Hetzel Eval Maturity Phases Framework

Last updated: 27 May 2026

Apply a structured, stage-by-stage methodology to design and mature your LLM/agent evaluation system — from first vibes to production flywheels — so your agent reaches production with measurable, defensible quality.

// TL;DR

The Hetzel Eval Maturity Phases Framework is a four-stage methodology for designing and maturing your LLM or AI agent evaluation system — from informal vibe checking through structured human annotation, LLM-as-judge scoring, and production flywheel automation. Use it whenever you're building or improving evals for an AI-powered application, especially when you're stuck in proof-of-concept and need a clear path to production-grade, measurable quality. It emphasizes targeting known failure modes over exhaustive testing, extracting domain knowledge from human annotators before automating, and always validating your LLM-as-judge against human ground truth.

Framework

// When should I use the Hetzel Eval Maturity Phases Framework?

Use this skill whenever you are building or improving an evaluation (eval) system for an AI agent or LLM-powered application. Particularly valuable when you are stuck in proof-of-concept and cannot bridge to production, or when your current eval approach is ad-hoc and you need to scale or add structure.

// What inputs do I need to apply the Hetzel Eval Maturity Phases Framework?

Agent or prompt under testrequired
A description of the agent, prompt, or workflow you want to evaluate — what it does, its domain, and its intended users.
Known or suspected failure modes
Any failure modes already identified by you or a subject matter expert: wrong answers, unsafe outputs, excessive cost, compliance risks, etc.
Current maturity levelrequired
Where you are today: vibe checking only, running human annotation, using LLM-as-judge, or dealing with multi-tool/external-system complexity.
Available data
Whether you have production traces, UAT traces, or only synthetic/hand-crafted examples to draw from for your dataset.
External system dependencies
Any CRUD-based or context-gathering tool calls, APIs, vector databases, or external systems the agent interacts with.

// What are the core principles behind the Hetzel Eval Maturity Phases Framework?

Evals are not unit tests

Do not attempt to exhaustively cover every possible failure path — that is infinite and unproductive. Instead, start high-level with the known failure modes of your agent, identified by you or a subject matter expert, and build evals specifically around those.

Eval results don't need to be perfect

Especially when using LLM-as-judge techniques, 100% accuracy is not the goal. Results can be directional. As long as scores are trending in the right direction as you iterate, that is completely fine.

Think about evals as rerunning production

The gold standard for an eval dataset is real production traces or at least UAT-level traces. The goal is not to run abstract tests — it is to replay what actually happens in production so you can remain confident in your agent.

The Flywheel

Capture agent traces in production, identify what is going wrong (via human or automated tooling), bring those examples back into an offline experimentation environment, rerun production through an eval, and use the results to guide the next improvement. This is playing offense with your evals.

Extract domain knowledge before you automate

Human annotators hold critical domain-specific knowledge about what quality looks like. You must extract that knowledge — through justifications, not just thumbs up/down — before you can scale it through LLM-as-judge or automated scoring.

Eval the eval

Just because you put a robe and a cloak on an LLM does not make it inherently more trustworthy as a judge. LLM-as-judge outputs must themselves be evaluated against a ground truth dataset to confirm they align with what a human expert would decide.

Evals serve agent quality

Everything in the eval system exists wholly in service to agent quality — managing reputational risk, systems cost risk, compliance risk, and enabling confident iteration. Keep this as the north star when deciding what to measure.

// How do you apply the Hetzel Eval Maturity Phases Framework step by step?

1
Identify your current maturity level and failure modes
Before designing any eval, locate yourself on the maturity continuum: (1) Just Getting Started / Vibe Checking, (2) Measuring to Manage, (3) Accounting for Complexity, (4) Advanced Techniques. Then list the agent's known failure modes with a subject matter expert — these are your primary eval targets, not an exhaustive test list.
2
Run structured vibe checking with documented human annotation
If at Level 1: give the agent 10+ example inputs and have a human (ideally a subject matter expert, not just the builder) review outputs. For each output, record: (a) a thumbs up or thumbs down, AND (b) a written justification explaining why. The justification is more important than the verdict — it externalises domain knowledge you will need later. Do not skip the justification step.
3
Derive failure modes from annotation justifications
At Level 2: feed the collected justifications into a coding assistant (e.g. Cursor, Cloud Code, Codex) to systematically extract and categorise the failure modes embedded in those thumbs-down justifications. The output is a structured list of failure modes your agent actually exhibits.
4
Build scoring functions — deterministic and/or LLM-as-judge
For each failure mode, decide: can this be caught deterministically (e.g. too many tool calls, excessive token usage, format errors)? If yes, write a code-based scoring function. If the failure mode is subjective or nuanced, implement an LLM-as-judge scoring function using the justification language from Step 2 as the basis for the judge prompt. Critically: you must then eval your LLM-as-judge outputs against a human ground truth dataset — do not trust them blindly.
5
Build and populate your eval dataset from production or UAT traces
Stop using only hand-crafted synthetic examples. Capture real production traces or UAT-level traces and add them to your eval dataset. The dataset should contain the inputs that initiate the task plus enough context to represent the system state at the time the trace was created. Think of this dataset as a snapshot of production, not a test suite.
6
Activate the Flywheel — close the loop between production and offline experimentation
Set up a continuous process: (1) capture agent traces in production, (2) surface failures via human review or automated tooling, (3) pull failing examples into your offline eval dataset, (4) rerun evals, (5) use results to guide your next agent improvement. This is the shift from defensive evals to playing offense.
7
Account for external system complexity — tool calls and state representation
At Level 3, if your agent makes tool calls: distinguish context-gathering tools (read-only data injection) from CRUD-based tools (create/read/update/delete on external systems). For CRUD tools: avoid writing to production systems during eval runs. Represent external system state inside the trace itself by embedding system state context into the trace payload. Use timestamp/version queries on systems that support them (e.g. vector databases) to replay the state that existed when the original trace was captured. Use mock APIs to approximate real production environments where full state replay is not possible.
8
Evaluate at the full trace level, not just the final output
For multi-step agents, you cannot only score the final response. You must instrument and capture the entire trace — every tool call, every intermediate step — and target scoring functions at individual steps, including individual tool or MCP calls. You need platform tooling capable of ingesting and querying arbitrarily large traces.
9
Scale failure mode discovery with topic modelling at production volume
At Level 4 (Advanced): rather than relying solely on manual failure mode identification, run topic modelling across production traces at scale to automatically surface emerging failure modes you did not anticipate. Combine with automated eval pipeline execution via CLI tooling to run evals in a fully automated, continuous manner.

// What are real-world examples of applying the Hetzel Eval Maturity Phases Framework?

A team has built a customer support agent for a SaaS product. They have been manually reviewing outputs informally but have no structured eval process. They want to know if it is ready for production.

They are at Level 1. The first action is to select 10–20 representative customer queries and run them through the agent. A customer support subject matter expert — not the engineer — reviews each output and records a thumbs up/down plus a written justification (e.g. 'Thumbs down — agent hallucinated a feature that does not exist' or 'Thumbs up — correctly escalated to human agent'). These justifications become the raw material for deriving failure modes in the next step, which then get encoded as LLM-as-judge scoring functions to scale evaluation beyond what the SME can manually review.

A team has an agent that books calendar appointments by calling an external scheduling API. Their eval runs keep accidentally creating real calendar entries in production.

This is a Level 3 CRUD-based tool call problem. The team should: (1) set up a mock scheduling API that accepts calls without writing real data, (2) embed the relevant external system state (e.g. which calendar slots were available at the time of the original trace) directly into the captured trace payload, and (3) configure their eval task to read that embedded state rather than query the live system. If the scheduling API supports timestamp queries, use version queries to replay the slot availability as it existed when the original production trace was captured.

A team is using an LLM-as-judge to score their RAG pipeline's answer quality and has full trust in those judge scores without any validation.

This violates the 'eval the eval' principle. The team must build a ground truth dataset of agent outputs that have been manually labelled by a human expert. They then run their LLM-as-judge against that dataset and measure alignment. Because LLM-as-judge outputs are discrete (e.g. pass/fail, 1–5 score), a human-labelled ground truth can be created and the judge's accuracy measured directly. Only once the judge demonstrates acceptable alignment with human decisions should it be trusted to score at scale.

// What mistakes should I avoid when implementing the Hetzel Eval Maturity Phases Framework?

Treating evals like unit tests — trying to exhaustively cover every possible failure scenario instead of targeting known, high-priority failure modes. You will spend all your time writing tests and none of your time shipping.
Collecting thumbs up/down without collecting the justification. The justification is what lets you later scale human knowledge into LLM-as-judge. Without it, you lose the domain-specific insight that makes the eval meaningful.
Trusting LLM-as-judge outputs without evaluating them. Putting a robe and a cloak on an LLM does not make it inherently more trustworthy — you must eval the eval against a human ground truth dataset.
Using only synthetic or hand-crafted examples in your eval dataset instead of real production or UAT traces. Evals should approximate rerunning production, not running abstract hypothetical tests.
For CRUD-based tool call agents: running evals that write to production systems, or failing to represent external system state accurately in the trace, leading to evals that do not reflect real conditions.
Evaluating only the final output of a multi-step agent instead of evaluating the full trace — individual tool calls, intermediate decisions, and all steps can each be a vector for failure.
Waiting until everything is perfect before starting evals. Vibe checking with documented human annotation is better than nothing and is a legitimate starting point.

// What are the key terms and definitions in the Hetzel Eval Maturity Phases Framework?

Evals: Evaluation runs performed on an agent or prompt to gain confidence in its quality before and during production. Distinct from unit tests — they target known failure modes rather than exhaustive coverage.
Agent quality: The north star goal of the entire eval system — ensuring an agent does what you expect when confronted with real usage and real users, managing reputational, systems cost, and compliance risks.
Failure modes: The specific, identified ways in which an agent can produce bad outputs. Evals are built around failure modes, not around exhaustive hypothetical scenarios.
Task: The agent or prompt under test — the thing being evaluated in an eval run.
Dataset: The collection of example inputs used to initiate the task during an eval. Ideally populated with production or UAT-level traces.
Scoring functions: The functions used to judge the utility or quality of a task's output. Can be deterministic (code-based) or non-deterministic (LLM-as-judge).
LLM-as-judge: A technique where a separate LLM is used to score or evaluate the outputs of the agent under test. Must itself be evaluated against human ground truth to verify alignment.
Eval the eval: The practice of validating your LLM-as-judge scoring functions by running them against a human-labelled ground truth dataset. Required because LLM judges are not inherently trustworthy without validation.
The Flywheel: The continuous improvement loop: capture production traces → identify failures (human or automated) → pull examples into offline experimentation → rerun as evals → use results to improve the agent → repeat.
Vibe checking: The earliest-stage eval practice: informally reviewing agent outputs without structured scoring. Considered a legitimate starting point as long as it is paired with documented human annotation.
Human annotation: The practice of having a human — ideally a subject matter expert — review agent outputs and record both a verdict (thumbs up/down) and a written justification. The justification is the critical output.
Context-gathering tools: Tool calls made by an agent that read data and inject it into the LLM context without modifying external systems. Lower risk in eval environments.
CRUD-based tools: Tool calls that create, read, update, or delete data in external systems. High risk in eval environments because they can corrupt production data if not handled with mocks or state isolation.
Trace: The full recorded execution of an agent run, capturing every step, tool call, input, and output. Can be arbitrarily large for complex agents. The primary unit of observability and eval input.
Maturity phases: The four-stage continuum through which eval practices evolve: (1) Just Getting Started, (2) Measuring to Manage, (3) Accounting for Complexity, (4) Advanced Eval Techniques.
Playing offense with evals: Using evals not just defensively (catching regressions) but proactively — using eval results to guide each incremental improvement to the agent and measure the impact of every change.

// FREQUENTLY ASKED QUESTIONS

What is the Hetzel Eval Maturity Phases Framework?

The Hetzel Eval Maturity Phases Framework is a four-stage methodology for evolving your AI agent evaluation system from ad-hoc vibe checking to automated production flywheels. Created by Phil Hetzel at Braintrust, it provides concrete steps at each maturity level: (1) structured vibe checking with documented human annotation, (2) measuring to manage via derived failure modes and scoring functions, (3) accounting for tool-call and external-system complexity, and (4) advanced techniques like topic modelling for failure mode discovery at scale.

What are the four maturity phases of LLM evals?

The four phases are: (1) Just Getting Started — vibe checking with documented human annotation and justifications, (2) Measuring to Manage — deriving failure modes from justifications and building deterministic or LLM-as-judge scoring functions, (3) Accounting for Complexity — handling CRUD-based tool calls, external system state, and full trace-level evaluation, and (4) Advanced Techniques — using topic modelling to surface emerging failure modes and running fully automated eval pipelines via CLI.

How do I start evaluating my LLM agent if I have no eval system yet?

Start with structured vibe checking at Level 1. Give your agent 10–20 representative inputs and have a subject matter expert — not just the builder — review each output. For every output, record a thumbs up or thumbs down plus a written justification explaining why. The justification is more important than the verdict because it externalizes domain knowledge you'll need to build automated scoring functions later. This is a legitimate starting point and far better than no evals at all.

How do you build an LLM-as-judge scoring function?

First, collect written justifications from human annotators reviewing your agent's outputs. Feed those justifications into a coding assistant to extract and categorize failure modes. For each subjective failure mode, write an LLM-as-judge prompt using the justification language as the basis. Then critically, validate your judge by running it against a human-labelled ground truth dataset and measuring alignment. Only trust the judge at scale once it demonstrates acceptable agreement with human expert decisions.

How does the Hetzel Eval Maturity Framework compare to just writing unit tests for my AI agent?

Unlike unit tests, which aim for exhaustive coverage of every possible case, the Hetzel framework targets known, high-priority failure modes identified by subject matter experts. Trying to write unit tests for every possible LLM failure path is infinite and unproductive. The framework also embraces non-determinism — eval results don't need to be perfect, just directional. Additionally, eval datasets should come from real production traces rather than synthetic test cases, making them closer to replaying production than running abstract tests.

When should I use the Hetzel Eval Maturity Phases Framework?

Use it whenever you're building or improving an evaluation system for an AI agent or LLM-powered application. It's particularly valuable when you're stuck in proof-of-concept and can't bridge to production, when your current eval approach is ad-hoc and needs structure, when you need to scale beyond manual review, or when your agent has external system dependencies like APIs and databases that complicate testing. It applies whether you're at zero evals or already have a mature system that needs the production flywheel.

What results can I expect from implementing the Hetzel eval maturity framework?

You can expect measurable, defensible agent quality that gives you confidence to ship to production. At each maturity level you gain: documented failure modes instead of gut feelings, reproducible scoring instead of ad-hoc reviews, safe evaluation of agents with external dependencies, and a continuous improvement flywheel that uses production traces to proactively guide agent improvements. Teams report moving from 'we think it works' to 'we can prove it works' with quantifiable quality metrics that satisfy compliance, reputational, and cost risk requirements.

What does 'eval the eval' mean in LLM evaluation?

'Eval the eval' means validating your LLM-as-judge scoring functions by running them against a human-labelled ground truth dataset. Just because you prompt an LLM to act as a judge doesn't make it inherently trustworthy. You must create a dataset of agent outputs manually labelled by a human expert, run your LLM judge against that same dataset, and measure whether the judge's scores align with human decisions. Only once alignment is acceptable should you trust the judge to score at scale.

What is the eval flywheel for AI agents?

The eval flywheel is a continuous improvement loop: capture agent traces in production, identify failures via human review or automated tooling, pull failing examples into your offline eval dataset, rerun evals, and use the results to guide your next agent improvement. This shifts evals from a defensive posture — catching regressions — to playing offense, where every production failure becomes fuel for measurable improvement. It's the mechanism that keeps your agent quality improving continuously after launch.

How do I evaluate an AI agent that makes API calls or writes to external systems?

Distinguish between context-gathering tools (read-only) and CRUD-based tools (create/update/delete). For CRUD tools, never write to production during eval runs. Instead: set up mock APIs that accept calls without writing real data, embed the external system state into your captured trace payload, and use timestamp or version queries to replay the state that existed during the original trace. This lets you evaluate realistically without corrupting production data or getting inaccurate results from changed system state.

// GET THIS SKILL — FREE