The False Binary

Every few months a new paper drops claiming to settle the fine-tuning vs. prompting debate. It never does, because the question itself is malformed. Fine-tuning and prompting are not substitutes. They operate at different layers of the stack and solve different categories of problems. A team that treats them as alternatives is almost certainly underusing one or both. The framing that actually works: prompting controls the model's behavior on a given inference call. Fine-tuning controls what behaviors are available to prompt. You can't prompt a model into capabilities it doesn't have. You can prompt a capable model into performing worse than it could.

When Prompting Is the Right Tool

Prompting is the right default for most tasks, most of the time. It's fast to iterate, cheap to deploy, and doesn't require labeled data. For tasks where the base model has strong latent capability — summarization, translation, code generation in common languages, structured extraction — a well-crafted prompt will get you 80–90% of the way to fine-tuned performance at zero marginal cost. Prompting also wins when your task distribution changes frequently. If the queries you're handling today look different from the queries you'll handle in six months, you want to iterate quickly. Fine-tuned models are harder to update without regression risk. The failure mode is prompt complexity. When your system prompt is 4,000 tokens of edge-case handling, that's usually a signal that you've crossed the threshold where fine-tuning would be more reliable, cheaper per call, and easier to maintain.

When Fine-Tuning Earns Its Cost

Fine-tuning earns its cost in three scenarios: format compliance, capability injection, and latency/cost optimization. Format compliance: if your application requires consistent structured output — a specific JSON schema, a rigid response template, a controlled tone — fine-tuning is dramatically more reliable than prompting. You can prompt GPT-4 to output JSON all day; it will still occasionally hallucinate fields. A fine-tuned model on 5,000 examples almost never will. Capability injection: some tasks genuinely require knowledge or skills the base model doesn't have. Domain-specific medical coding, proprietary classification taxonomies, internal jargon. Prompting can't inject knowledge that wasn't in pre-training. Fine-tuning can. Cost/latency optimization: a fine-tuned smaller model often outperforms a prompted larger model at 10–20% of the inference cost. If you're running millions of queries per day, this math matters enormously.

"Prompting controls behavior on a given call. Fine-tuning controls what behaviors are available to prompt. They're not alternatives — they're layers."

The Data Problem

The biggest underestimated cost in fine-tuning is data, not compute. Training a model for a few hours on a modern GPU is cheap. Building a labeled dataset of 2,000–10,000 high-quality examples is not. We've seen teams underestimate this by an order of magnitude. They budget two weeks for fine-tuning and spend six weeks collecting and cleaning data. The model trains in a day. Our rule of thumb: if you can't produce 500 high-quality input-output pairs for your target task, don't fine-tune yet. Use prompting, collect real outputs, and curate the best ones into a dataset. Fine-tune once you have signal about what "good" looks like.

Our Decision Framework

We use a four-question rubric with every client before recommending an approach: 1. Do you have high-quality labeled data, or can you generate it cheaply? No → start with prompting. 2. Is the task within the base model's latent capability? Yes → prompting is likely sufficient. 3. Does your system prompt exceed 2,000 tokens of behavioral specification? Yes → evaluate fine-tuning. 4. Are you running >100k queries/day on an expensive model? Yes → the ROI on fine-tuning is almost certain. The answer is rarely "only one." Our most effective deployments use a fine-tuned model for the high-volume, well-defined core task, and a capable prompted model for the long-tail of unusual queries.