3× throughput at 60% lower cost
The startup had built a compelling AI product — query accuracy was high, user experience was strong, and they were growing 25% month-over-month. The problem was structural: their entire product ran on GPT-4 API calls, and the economics didn't work at scale. At 4 million queries per day and $0.08 per query, inference cost was $320,000 per day — $9.6M per month. Revenue was growing but costs were growing faster. Gross margin was declining toward a point where additional growth would actively destroy value. Investors had flagged the unit economics as the primary obstacle to the next funding round.
The company had tried the obvious solutions. Switching to GPT-3.5 had produced a measurable drop in user-reported quality that appeared in churn data within 3 weeks. They had reverted. Aggressive prompt compression had helped marginally — maybe 15% cost reduction — but hadn't moved the needle on the fundamental problem. Any cost reduction approach needed to maintain 94%+ quality parity on their benchmark set of 5,000 human-labeled queries. This was a hard constraint, not a guideline. A solution that achieved 90% quality parity would not be acceptable, regardless of cost savings.
"At $0.08 per query and 4M daily queries, inference was consuming 40% of gross revenue and blocking profitability. The unit economics had to change."
The first step was understanding query distribution. We built a classifier to categorize all queries by complexity, domain, and required capability level. The analysis revealed that 68% of daily queries were variations on 12 high-frequency task types — structured extraction, classification, summarization within narrow domains — that did not require frontier model capability. A lightweight routing model (a fine-tuned 1B parameter classifier) directs queries to one of three tiers: fine-tuned small model, GPT-4o (mid-tier), or GPT-4 Turbo (frontier). The routing model costs less than $0.0001 per classification and pays for itself within a few hundred queries.
We fine-tuned three 7B parameter models on the company's high-volume task types, using GPT-4 outputs as training targets. The fine-tuning dataset was built from 90 days of production traffic: input queries, GPT-4 outputs, and human quality labels. Only queries with human quality labels of 4 or 5 out of 5 were included in fine-tuning data. The fine-tuned models were evaluated against the 5,000-query benchmark before any production traffic was routed to them. All three models cleared the 94% quality parity threshold. The fine-tuning compute cost was recovered within 11 days of deployment from the API cost savings.
For the remaining GPT-4 traffic (queries too complex or novel for fine-tuned models), we implemented speculative decoding with a 1B parameter draft model from the same family, achieving 2.7× throughput improvement on the target model infrastructure. The fine-tuned model deployment was migrated to reserved GPU instances with continuous batching via vLLM and INT8 quantization. Quantization was validated against quality benchmarks — the 0.3% quality degradation from quantization was within acceptable tolerance and the throughput gain was significant.
"The fine-tuning compute cost was recovered within 11 days of deployment from API cost savings alone. After that, every day was pure margin improvement."
Blended inference cost dropped from $0.08 to $0.003 per query — a 96% reduction. The breakdown: 68% of queries on fine-tuned self-hosted models at $0.0002/query, 24% on GPT-4o at $0.008/query, 8% on GPT-4 Turbo at $0.028/query, plus routing overhead of $0.0001/query. Quality on the 5,000-query benchmark: 94.2% parity with the original all-GPT-4 baseline. The 5.8% quality gap was concentrated in a specific query subcategory — complex multi-document synthesis — where the routing model correctly directs traffic to GPT-4 Turbo but the fine-tuned models don't compete. This subcategory represents 3% of query volume and 8% of remaining cost.
Monthly inference cost dropped from $9.6M to $360,000 — freeing $9.24M in monthly cash that had been consumed by API costs. Gross margin improved by 37 percentage points. The company completed their Series C at a significantly improved valuation relative to pre-optimization projections. The routing infrastructure also produced a strategic benefit: the company now has a detailed understanding of their query distribution and can make targeted improvements. When a new model releases, they can quickly evaluate it against specific query subcategories rather than running expensive blanket comparisons.