The Starting Point: $0.08 per Query

Eighteen months ago, our client was running their AI-powered document analysis product on GPT-4 Turbo. The system prompt was 6,000 tokens. Each query included a 10,000-token document excerpt and generated an average 800-token response. Total: roughly 17,000 tokens per query at $0.01/1k input, $0.03/1k output. The math came out to $0.08 per query. At 500,000 queries per month, that was $40,000/month in inference costs alone โ€” for a product that charged users $299/month. The unit economics didn't work. Over six months, we reduced cost per query to $0.003 โ€” a 96% reduction. The product now runs profitably at scale. Here's the exact playbook.

Step 1: Prompt Compression (Month 1)

The first thing we always do is audit the system prompt for bloat. In this case, 6,000 tokens had grown organically over a year of iteration โ€” adding edge case handling, examples, clarifications. A significant fraction was redundant. We rewrote the system prompt from scratch with the target of expressing the same behavioral specification in fewer tokens. Result: 6,000 tokens to 1,800 tokens. Cost reduction: ~25%. The process isn't just cutting words. We used LLMLingua, a prompt compression tool, to identify which parts of the prompt most influenced model behavior and which were noise. Some teams are surprised to find that 30โ€“40% of their system prompt has negligible effect on outputs. Cutting it saves tokens and often improves consistency.

Step 2: Task Decomposition and Model Routing (Month 2)

The original system processed every query with GPT-4 Turbo. But not every query is equally complex. About 70% of queries were straightforward extraction tasks โ€” pull specific fields from a document, check for presence of a clause, categorize document type. These didn't need frontier model capability. We built a task router that classified queries into three tiers: simple extraction (GPT-4o mini), moderate analysis (GPT-4o), and complex reasoning (GPT-4 Turbo). The router itself runs on GPT-4o mini and costs less than $0.0001 per classification. After routing, 68% of queries ran on GPT-4o mini, 24% on GPT-4o, and 8% on GPT-4 Turbo. Blended cost: roughly 60% reduction from the pre-routing baseline.

"96% cost reduction sounds dramatic, but it's methodical: prompt compression, routing, fine-tuning, caching, and hardware optimization โ€” each step builds on the last."

Step 3: Fine-Tuning for High-Volume Tasks (Months 3โ€“4)

The 68% of queries running on GPT-4o mini were still more expensive than necessary. These extraction tasks had clear, consistent patterns that a smaller fine-tuned model could handle. We collected 8,000 examples of GPT-4o mini inputs and outputs on the high-volume extraction tasks, filtered to examples where the output was verified correct. We fine-tuned Llama 3.1 8B on this dataset. The fine-tuned model matched GPT-4o mini accuracy on 94% of extraction queries and cost $0.0002 per query to run on our own infrastructure vs. $0.003 via API. This is the highest-ROI step in the playbook for teams with stable, high-volume tasks and the infrastructure to run their own models.

Steps 4โ€“5: Semantic Caching and Hardware Optimization (Months 5โ€“6)

By month 5, costs were already down ~85%. The final 11 percentage points came from two sources. Semantic caching: users frequently ask similar but not identical questions. "What are the termination clauses?" and "Summarize the termination provisions" often have identical answers. We built a semantic cache that embeds incoming queries and checks for similar cached responses within a cosine similarity threshold of 0.95. Cache hit rate: 23%. Effective query cost for cache misses is unchanged; cached queries cost only the embedding lookup (~$0.00001). Hardware optimization: we migrated the fine-tuned Llama 3.1 8B deployment from on-demand cloud GPU instances to reserved instances with continuous batching via vLLM. This reduced per-query inference cost by another 40% on the queries running on our infrastructure. Final state: $0.003/query blended across all tiers, routing overhead, and cache hits. At 500,000 queries/month: $1,500/month versus the original $40,000.