Case study · published v1.0 (placeholder; customer name redacted pending sign-off)

An LLM SaaS company moving 40M inference calls / month off OpenAI

The before

~40M chat-completion calls / month split across gpt-4o-mini (the bulk) and gpt-4o (the tail)
Monthly OpenAI bill: ~$54k
Two-region failover via direct OpenAI in us-east + eu-west
No per-call cost attribution back to their own end-customers

The migration

One sprint. Two lines changed in their inference client (base URL + token). Behind a feature flag, ramped 5% → 25% → 100% over six days. Their internal evals (a fixed test set of 1,200 prompts scored by a held-out judge model) showed quality parity within the noise on the model: "auto" setting.

The after

Metric	Before	After
Monthly bill	$54,200	$28,900
p50 latency	620 ms	510 ms
p99 latency	3.4 s	2.9 s
Quality (judge-model score)	0.84	0.83
Per-call carbon attribution	no	yes

What the customer said

"The unlock wasn't the cost, although the cost was great. The unlock was getting the per-call joules so our pricing model could be defensible to our own customers. We can now charge per AI feature based on what it actually costs us to run."

What we learned

The router's default classification was conservative for this customer's traffic — many "looks like reasoning" calls were really template fill-ins. We adjusted the L2 → L1 threshold and saved them another ~$3k/mo.
Per-customer billing attribution turned a perceived cost ceiling into a margin-positive product line. They now upsell a "compliance receipt" tier.