Why Benchmarks Fail You

MMLU measures knowledge breadth across 57 academic subjects. HumanEval measures code generation on algorithmic problems. HellaSwag measures common-sense reasoning with sentence completions. None of these tell you how a model will handle your users' actual queries, in your actual product, with your actual system prompt. The benchmark problem is distribution mismatch. Benchmark tasks are designed to be measurable, not realistic. They optimize for clean evaluation over ecological validity. The result: a model that leads every benchmark leaderboard may underperform a model ranked below it on the task you actually care about. We've seen this happen repeatedly.

Building a Task-Specific Eval Set

The first evaluation infrastructure we build for every client is a golden set: 200โ€“500 real queries from production or user research, with human-labeled ideal responses. This sounds expensive. It isn't, relative to the cost of shipping a bad model. A skilled annotator can label 50โ€“100 examples per day. A golden set of 300 examples, built carefully, is worth more than any public benchmark for predicting production performance. The key is diversity. A golden set that covers only the happy path will tell you nothing about failure modes. We use stratified sampling: ~60% common cases, ~25% edge cases, ~15% adversarial queries โ€” questions designed to elicit hallucination, refusal, or format violations.

Automated Eval at Scale

Human evaluation doesn't scale. Once you have a golden set with human-labeled responses, you need automated eval to run on every model update, every prompt change, every fine-tuning run. We use a three-layer automated eval stack: exact match for structured outputs (JSON schema compliance, required field presence), semantic similarity for prose outputs (cosine similarity against the golden response using an embedding model), and LLM-as-judge for quality dimensions that resist automated scoring (helpfulness, tone, factual accuracy). LLM-as-judge gets a bad reputation because it's often implemented naively. We've found it's reliable when: the judge model is more capable than the evaluated model, the rubric is concrete and specific (not "rate quality 1-10"), and judgments are calibrated against human labels on a subset of the eval set.

"A golden set of 300 carefully curated examples from your actual users is worth more than every public benchmark for predicting whether your model will work in production."

Regression Testing and CI

Evaluation isn't a one-time event โ€” it's a continuous process that needs to integrate with your deployment pipeline. We build model evaluation into CI/CD: every pull request that touches a system prompt, model version, or fine-tuning dataset triggers an eval run against the golden set. Regressions block deployment. The threshold is set conservatively: a 1% drop in eval score on any dimension requires human review before merge. This sounds bureaucratic. It isn't. The alternative is discovering regressions in production, through user complaints, three weeks after a prompt change that seemed minor.

Monitoring in Production

Offline evaluation tells you what your model does in controlled conditions. Production monitoring tells you what it actually does. We instrument every production deployment with: output length distribution (sudden changes often signal prompt regression), refusal rate (should be stable; spikes indicate distribution shift or adversarial probing), latency percentiles (p50 and p99), and a 1โ€“5% sample sent to LLM-as-judge for ongoing quality scoring. The monitoring stack alerts on statistical anomalies, not fixed thresholds. What's normal for one deployment is pathological for another. Anomaly detection on rolling baselines catches real problems; fixed threshold alerts generate noise.