Shipping AI features you can actually trust
There is a real gap between an LLM demo that impresses a room and a feature that survives a quarter in production. The demo uses three curated inputs and a warm cache. Production sees the long tail: malformed documents, ambiguous questions, adversarial users, a retrieval index that has quietly drifted two weeks out of date. If you have not instrumented for that gap, you will ship something that looks like a feature and behaves like a liability.
We have shipped retrieval augmented systems, agentic workflows, and classification pipelines at scale across financial services and enterprise ops. The pattern that separates the teams who keep shipping from the teams who quietly roll back is not model choice. It is the set of gates they put between a promising prompt and a production deploy.
Gate 1: a retrieval eval that predates the generator
If your system reads from a corpus, and most production LLM features do, retrieval quality is the dominant lever on output quality. A frontier model fed the wrong three chunks will hallucinate confidently. A mid tier model fed the right three chunks is usually fine.
Build the retrieval eval first, and keep it honest. A set of 150 to 300 representative questions, each with labeled relevant documents, is enough to get started. Score with recall@k, nDCG@10, and a token budget aware MRR. Run it on every embedding model change, every chunking change, every index config change. Do not ship a retrieval change that regresses recall@10 by more than an agreed threshold, even if the end to end eval looks unchanged. All that means is that today’s generator is covering for tomorrow’s regression.
Gate 2: a task eval that scores what the user actually wants
End to end evals are harder than retrieval evals because the correct answer is often a distribution, not a string. Two techniques carry most of the weight.
The first is structured output with hard schema validation. If the feature is “answer the question and cite three sources,” the schema is a JSON object with exactly those fields. A malformed response is a failure. Count it as one. This alone eliminates a class of “sort of worked” outputs that humans grade inconsistently.
The second is LLM as judge with calibration. A judge model scoring factuality, grounding, and refusal behavior works if, and only if, you have periodically audited its scores against a human panel on a held out slice. We target at least 0.7 Cohen’s kappa against human labels before trusting the judge in CI. Below that, the signal is noise.
Gate 3: a latency and cost budget, enforced in CI
The quiet failure mode of LLM features is not wrong answers. It is a $40,000 monthly bill for a feature that three hundred people use. Cost and latency are quality attributes, and they deserve to be treated that way.
Every prompt change runs through a budget harness: p50 and p95 latency, tokens in, tokens out, cache hit rate, cost per request. Regressions greater than a preset threshold block the PR. This catches the most expensive mistakes early. The well meaning prompt expansion that adds 2,000 tokens of “context” and quintuples the bill. The chain of thought block someone forgot to strip before shipping. The retrieval k that drifted from 5 to 20 during debugging and never went back.
Gate 4: observability for the failure modes that matter
Traditional APM tools tell you the request completed in 1.8 seconds. They do not tell you that the model refused to answer because retrieval returned documents in the wrong language, or that output tokens silently truncated mid-JSON, or that a new document in the corpus is poisoning answers for a specific customer segment.
Ship with structured logs on every generation: the user question, the retrieved chunk IDs, the final prompt, the raw output, the parsed output, the judge score (if applicable), the cost. Sample aggressively, retain for at least 30 days, and wire alerts on the distributions: refusal rate, schema failure rate, grounding score, p95 latency. The first week after launch, someone senior reads a stratified sample every day. We have never run this exercise without finding at least one systematic failure the eval missed.
What this looks like in aggregate
None of these gates requires exotic tooling. A Postgres table for eval cases, a CI job that runs the harness on every PR, a Grafana board for the production distributions, and a weekly review meeting. The discipline is more boring than the demo. It is also the reason the feature is still running in six months.
If you are considering shipping an AI feature and have not built the retrieval eval, the budget harness, and the observability pipeline, you are building the demo. That is a fine place to start. Just do not confuse it with a product.