Case study · published v1.0 (placeholder; customer name redacted pending sign-off)
An LLM SaaS company moving 40M inference calls / month off OpenAI
The before
- ~40M chat-completion calls / month split across
gpt-4o-mini(the bulk) andgpt-4o(the tail) - Monthly OpenAI bill: ~$54k
- Two-region failover via direct OpenAI in
us-east+eu-west - No per-call cost attribution back to their own end-customers
The migration
One sprint. Two lines changed in their inference client (base URL + token). Behind a feature flag, ramped 5% → 25% → 100% over six days. Their internal evals (a fixed test set of 1,200 prompts scored by a held-out judge model) showed quality parity within the noise on the model: "auto" setting.
The after
| Metric | Before | After |
|---|---|---|
| Monthly bill | $54,200 | $28,900 |
| p50 latency | 620 ms | 510 ms |
| p99 latency | 3.4 s | 2.9 s |
| Quality (judge-model score) | 0.84 | 0.83 |
| Per-call carbon attribution | no | yes |
What the customer said
"The unlock wasn't the cost, although the cost was great. The unlock was getting the per-call joules so our pricing model could be defensible to our own customers. We can now charge per AI feature based on what it actually costs us to run."
What we learned
- The router's default classification was conservative for this customer's traffic — many "looks like reasoning" calls were really template fill-ins. We adjusted the L2 → L1 threshold and saved them another ~$3k/mo.
- Per-customer billing attribution turned a perceived cost ceiling into a margin-positive product line. They now upsell a "compliance receipt" tier.