How Do AI Engineers Mature Their LLM Eval Systems?

For AI engineers building LLM-powered SaaS products · Based on Hetzel Eval Maturity Phases Framework

// TL;DR

The Hetzel Eval Maturity Phases Framework gives AI engineers a concrete, four-stage roadmap for evolving LLM evaluation from ad-hoc vibe checking to automated production flywheels. Use it when you're building evals for an LLM-powered SaaS product and need to bridge the gap between 'it seems to work' and 'we can prove it works.' The framework targets known failure modes rather than exhaustive test coverage, validates LLM-as-judge scoring against human ground truth, and establishes a continuous improvement loop powered by real production traces.

Why do most AI engineering teams get stuck between proof-of-concept and production?

The gap between a working demo and a production-ready LLM agent almost always comes down to evaluation. Without structured evals, you can't measure quality, you can't defend your deployment decision, and you can't iterate with confidence. The Hetzel Eval Maturity Phases Framework addresses this directly by giving you a clear progression path.

Most AI engineering teams remain stuck at what the framework calls Level 1 — vibe checking. They run a few examples, eyeball the results, and ship based on gut feel. This works for demos but creates reputational, compliance, and cost risk in production.

How do you apply the four maturity phases as an AI engineer?

Level 1 — Structured Vibe Checking: Run 10–20 representative inputs through your agent. Have a subject matter expert (not just you, the builder) review each output with a thumbs up/down AND a written justification. The justification is the critical artifact — it externalizes domain knowledge you'll need later.

Level 2 — Measuring to Manage: Feed the collected justifications into a coding assistant to extract and categorize failure modes. For each failure mode, decide: is this deterministic (catchable with code) or subjective (requiring LLM-as-judge)? Build the appropriate scoring function. Then validate your LLM-as-judge by running it against a human-labelled ground truth dataset — this is the 'eval the eval' principle.

Level 3 — Accounting for Complexity: If your agent makes tool calls, distinguish context-gathering (read-only) from CRUD-based (write) tools. Set up mock APIs for CRUD tools so eval runs never write to production. Embed external system state into your trace payloads. Evaluate the full trace — every tool call and intermediate step — not just the final output.

Level 4 — Advanced Techniques: Run topic modelling across production traces to surface failure modes you didn't anticipate. Automate eval pipeline execution via CLI. At this stage, your eval system is a continuous production improvement engine.

What mistakes do AI engineers make most often with LLM evals?

The most common mistake is treating evals like unit tests — trying to cover every possible failure scenario. LLM failure spaces are infinite; targeting known, high-priority failure modes is the correct approach. Second, teams collect thumbs up/down without justifications, losing the domain knowledge that makes LLM-as-judge prompts effective. Third, teams trust LLM-as-judge scores without validation — putting a robe and a cloak on an LLM doesn't make it inherently trustworthy.

Another critical mistake: using only synthetic examples instead of real production or UAT traces. Your eval dataset should approximate rerunning production, not running abstract hypotheticals.

How does the eval flywheel accelerate iteration speed?

The flywheel is what turns evals from a defensive checkpoint into an offensive improvement tool. The loop: capture agent traces in production → surface failures via human review or automated tooling → pull failing examples into your offline eval dataset → rerun evals → use results to guide your next improvement → measure the impact → repeat.

This means every production failure becomes fuel for measurable improvement. Every agent change is validated against real-world conditions. You stop guessing and start proving.

What's the next step?

Identify your current maturity level honestly. If you're at Level 1, start today by having a domain expert review 10 agent outputs with written justifications. If you're at Level 2, build your first LLM-as-judge scoring function and validate it against human ground truth. The framework is designed for incremental progress — start where you are and advance one level at a time.

// FREQUENTLY ASKED QUESTIONS

How many eval examples do I need before shipping my LLM product to production?

Start with at least 10–20 representative examples reviewed by a subject matter expert with written justifications. For production readiness, aim to have enough examples covering each of your top 3–5 identified failure modes. Real production or UAT traces are far more valuable than synthetic examples. The framework emphasizes starting imperfect and growing your dataset through the flywheel rather than waiting for a perfect dataset.

Can I use the Hetzel framework with any LLM eval platform?

Yes, the framework is methodology-agnostic. While Phil Hetzel developed it at Braintrust, the principles — structured annotation with justifications, failure mode derivation, LLM-as-judge validation, production trace datasets, and the flywheel — apply regardless of your tooling. You need trace capture, scoring function execution, and dataset management capabilities, but these can be built with various platforms or even custom tooling.

How do I convince my team that vibe checking isn't enough for production?

Frame it in terms of risk. Vibe checking gives you no measurable quality metrics, no regression detection, and no defensible basis for production decisions. The Hetzel framework provides a concrete path: documented justifications from vibe checks become derived failure modes, which become scoring functions, which produce quantifiable quality scores. Show that the first step beyond vibe checking — collecting justifications — adds minimal overhead but unlocks the entire maturity progression.