Hetzel Eval Maturity Phases Framework
Apply a structured, stage-by-stage methodology to design and mature your LLM/agent evaluation system — from first vibes to production flywheels — so your agent reaches production with measurable, defensible quality.
// TL;DR
The Hetzel Eval Maturity Phases Framework is a four-stage methodology for designing and maturing your LLM or AI agent evaluation system — from informal vibe checking through production flywheels. Use it whenever you're building, scaling, or improving evals for an AI-powered application. It's especially valuable when you're stuck in proof-of-concept and can't bridge to production, or when your current eval approach is ad-hoc. The framework guides you to target known failure modes (not exhaustive test coverage), extract domain knowledge from human annotators, validate LLM-as-judge scorers against ground truth, and close the loop between production traces and offline experimentation.
// When should I use the Hetzel Eval Maturity Phases Framework?
Use this skill whenever you are building or improving an evaluation (eval) system for an AI agent or LLM-powered application. Particularly valuable when you are stuck in proof-of-concept and cannot bridge to production, or when your current eval approach is ad-hoc and you need to scale or add structure.
// What inputs do I need before applying the Hetzel Eval Maturity Framework?
- Agent or prompt under testrequired
A description of the agent, prompt, or workflow you want to evaluate — what it does, its domain, and its intended users. - Known or suspected failure modes
Any failure modes already identified by you or a subject matter expert: wrong answers, unsafe outputs, excessive cost, compliance risks, etc. - Current maturity levelrequired
Where you are today: vibe checking only, running human annotation, using LLM-as-judge, or dealing with multi-tool/external-system complexity. - Available data
Whether you have production traces, UAT traces, or only synthetic/hand-crafted examples to draw from for your dataset. - External system dependencies
Any CRUD-based or context-gathering tool calls, APIs, vector databases, or external systems the agent interacts with.
// What are the core principles behind the Hetzel Eval Maturity Framework?
Evals are not unit tests
Do not attempt to exhaustively cover every possible failure path — that is infinite and unproductive. Instead, start high-level with the known failure modes of your agent, identified by you or a subject matter expert, and build evals specifically around those.
Eval results don't need to be perfect
Especially when using LLM-as-judge techniques, 100% accuracy is not the goal. Results can be directional. As long as scores are trending in the right direction as you iterate, that is completely fine.
Think about evals as rerunning production
The gold standard for an eval dataset is real production traces or at least UAT-level traces. The goal is not to run abstract tests — it is to replay what actually happens in production so you can remain confident in your agent.
The Flywheel
Capture agent traces in production, identify what is going wrong (via human or automated tooling), bring those examples back into an offline experimentation environment, rerun production through an eval, and use the results to guide the next improvement. This is playing offense with your evals.
Extract domain knowledge before you automate
Human annotators hold critical domain-specific knowledge about what quality looks like. You must extract that knowledge — through justifications, not just thumbs up/down — before you can scale it through LLM-as-judge or automated scoring.
Eval the eval
Just because you put a robe and a cloak on an LLM does not make it inherently more trustworthy as a judge. LLM-as-judge outputs must themselves be evaluated against a ground truth dataset to confirm they align with what a human expert would decide.
Evals serve agent quality
Everything in the eval system exists wholly in service to agent quality — managing reputational risk, systems cost risk, compliance risk, and enabling confident iteration. Keep this as the north star when deciding what to measure.
// How do you apply the Hetzel Eval Maturity Framework step by step?
- 1
Identify your current maturity level and failure modes
Before designing any eval, locate yourself on the maturity continuum: (1) Just Getting Started / Vibe Checking, (2) Measuring to Manage, (3) Accounting for Complexity, (4) Advanced Techniques. Then list the agent's known failure modes with a subject matter expert — these are your primary eval targets, not an exhaustive test list.
- 2
Run structured vibe checking with documented human annotation
If at Level 1: give the agent 10+ example inputs and have a human (ideally a subject matter expert, not just the builder) review outputs. For each output, record: (a) a thumbs up or thumbs down, AND (b) a written justification explaining why. The justification is more important than the verdict — it externalises domain knowledge you will need later. Do not skip the justification step.
- 3
Derive failure modes from annotation justifications
At Level 2: feed the collected justifications into a coding assistant (e.g. Cursor, Cloud Code, Codex) to systematically extract and categorise the failure modes embedded in those thumbs-down justifications. The output is a structured list of failure modes your agent actually exhibits.
- 4
Build scoring functions — deterministic and/or LLM-as-judge
For each failure mode, decide: can this be caught deterministically (e.g. too many tool calls, excessive token usage, format errors)? If yes, write a code-based scoring function. If the failure mode is subjective or nuanced, implement an LLM-as-judge scoring function using the justification language from Step 2 as the basis for the judge prompt. Critically: you must then eval your LLM-as-judge outputs against a human ground truth dataset — do not trust them blindly.
- 5
Build and populate your eval dataset from production or UAT traces
Stop using only hand-crafted synthetic examples. Capture real production traces or UAT-level traces and add them to your eval dataset. The dataset should contain the inputs that initiate the task plus enough context to represent the system state at the time the trace was created. Think of this dataset as a snapshot of production, not a test suite.
- 6
Activate the Flywheel — close the loop between production and offline experimentation
Set up a continuous process: (1) capture agent traces in production, (2) surface failures via human review or automated tooling, (3) pull failing examples into your offline eval dataset, (4) rerun evals, (5) use results to guide your next agent improvement. This is the shift from defensive evals to playing offense.
- 7
Account for external system complexity — tool calls and state representation
At Level 3, if your agent makes tool calls: distinguish context-gathering tools (read-only data injection) from CRUD-based tools (create/read/update/delete on external systems). For CRUD tools: avoid writing to production systems during eval runs. Represent external system state inside the trace itself by embedding system state context into the trace payload. Use timestamp/version queries on systems that support them (e.g. vector databases) to replay the state that existed when the original trace was captured. Use mock APIs to approximate real production environments where full state replay is not possible.
- 8
Evaluate at the full trace level, not just the final output
For multi-step agents, you cannot only score the final response. You must instrument and capture the entire trace — every tool call, every intermediate step — and target scoring functions at individual steps, including individual tool or MCP calls. You need platform tooling capable of ingesting and querying arbitrarily large traces.
- 9
Scale failure mode discovery with topic modelling at production volume
At Level 4 (Advanced): rather than relying solely on manual failure mode identification, run topic modelling across production traces at scale to automatically surface emerging failure modes you did not anticipate. Combine with automated eval pipeline execution via CLI tooling to run evals in a fully automated, continuous manner.
// What does the Hetzel Eval Maturity Framework look like in practice?
A team has built a customer support agent for a SaaS product. They have been manually reviewing outputs informally but have no structured eval process. They want to know if it is ready for production.
They are at Level 1. The first action is to select 10–20 representative customer queries and run them through the agent. A customer support subject matter expert — not the engineer — reviews each output and records a thumbs up/down plus a written justification (e.g. 'Thumbs down — agent hallucinated a feature that does not exist' or 'Thumbs up — correctly escalated to human agent'). These justifications become the raw material for deriving failure modes in the next step, which then get encoded as LLM-as-judge scoring functions to scale evaluation beyond what the SME can manually review.
A team has an agent that books calendar appointments by calling an external scheduling API. Their eval runs keep accidentally creating real calendar entries in production.
This is a Level 3 CRUD-based tool call problem. The team should: (1) set up a mock scheduling API that accepts calls without writing real data, (2) embed the relevant external system state (e.g. which calendar slots were available at the time of the original trace) directly into the captured trace payload, and (3) configure their eval task to read that embedded state rather than query the live system. If the scheduling API supports timestamp queries, use version queries to replay the slot availability as it existed when the original production trace was captured.
A team is using an LLM-as-judge to score their RAG pipeline's answer quality and has full trust in those judge scores without any validation.
This violates the 'eval the eval' principle. The team must build a ground truth dataset of agent outputs that have been manually labelled by a human expert. They then run their LLM-as-judge against that dataset and measure alignment. Because LLM-as-judge outputs are discrete (e.g. pass/fail, 1–5 score), a human-labelled ground truth can be created and the judge's accuracy measured directly. Only once the judge demonstrates acceptable alignment with human decisions should it be trusted to score at scale.
// What mistakes should I avoid when implementing the Hetzel Eval Maturity Framework?
- Treating evals like unit tests — trying to exhaustively cover every possible failure scenario instead of targeting known, high-priority failure modes. You will spend all your time writing tests and none of your time shipping.
- Collecting thumbs up/down without collecting the justification. The justification is what lets you later scale human knowledge into LLM-as-judge. Without it, you lose the domain-specific insight that makes the eval meaningful.
- Trusting LLM-as-judge outputs without evaluating them. Putting a robe and a cloak on an LLM does not make it inherently more trustworthy — you must eval the eval against a human ground truth dataset.
- Using only synthetic or hand-crafted examples in your eval dataset instead of real production or UAT traces. Evals should approximate rerunning production, not running abstract hypothetical tests.
- For CRUD-based tool call agents: running evals that write to production systems, or failing to represent external system state accurately in the trace, leading to evals that do not reflect real conditions.
- Evaluating only the final output of a multi-step agent instead of evaluating the full trace — individual tool calls, intermediate decisions, and all steps can each be a vector for failure.
- Waiting until everything is perfect before starting evals. Vibe checking with documented human annotation is better than nothing and is a legitimate starting point.
// What are the key terms used in the Hetzel Eval Maturity Framework?
- Evals
- Evaluation runs performed on an agent or prompt to gain confidence in its quality before and during production. Distinct from unit tests — they target known failure modes rather than exhaustive coverage.
- Agent quality
- The north star goal of the entire eval system — ensuring an agent does what you expect when confronted with real usage and real users, managing reputational, systems cost, and compliance risks.
- Failure modes
- The specific, identified ways in which an agent can produce bad outputs. Evals are built around failure modes, not around exhaustive hypothetical scenarios.
- Task
- The agent or prompt under test — the thing being evaluated in an eval run.
- Dataset
- The collection of example inputs used to initiate the task during an eval. Ideally populated with production or UAT-level traces.
- Scoring functions
- The functions used to judge the utility or quality of a task's output. Can be deterministic (code-based) or non-deterministic (LLM-as-judge).
- LLM-as-judge
- A technique where a separate LLM is used to score or evaluate the outputs of the agent under test. Must itself be evaluated against human ground truth to verify alignment.
- Eval the eval
- The practice of validating your LLM-as-judge scoring functions by running them against a human-labelled ground truth dataset. Required because LLM judges are not inherently trustworthy without validation.
- The Flywheel
- The continuous improvement loop: capture production traces → identify failures (human or automated) → pull examples into offline experimentation → rerun as evals → use results to improve the agent → repeat.
- Vibe checking
- The earliest-stage eval practice: informally reviewing agent outputs without structured scoring. Considered a legitimate starting point as long as it is paired with documented human annotation.
- Human annotation
- The practice of having a human — ideally a subject matter expert — review agent outputs and record both a verdict (thumbs up/down) and a written justification. The justification is the critical output.
- Context-gathering tools
- Tool calls made by an agent that read data and inject it into the LLM context without modifying external systems. Lower risk in eval environments.
- CRUD-based tools
- Tool calls that create, read, update, or delete data in external systems. High risk in eval environments because they can corrupt production data if not handled with mocks or state isolation.
- Trace
- The full recorded execution of an agent run, capturing every step, tool call, input, and output. Can be arbitrarily large for complex agents. The primary unit of observability and eval input.
- Maturity phases
- The four-stage continuum through which eval practices evolve: (1) Just Getting Started, (2) Measuring to Manage, (3) Accounting for Complexity, (4) Advanced Eval Techniques.
- Playing offense with evals
- Using evals not just defensively (catching regressions) but proactively — using eval results to guide each incremental improvement to the agent and measure the impact of every change.
// FREQUENTLY ASKED QUESTIONS
What is the Hetzel Eval Maturity Phases Framework?
The Hetzel Eval Maturity Phases Framework is a four-stage methodology for building and maturing AI agent and LLM evaluation systems. The four stages are: (1) Just Getting Started / Vibe Checking, (2) Measuring to Manage, (3) Accounting for Complexity, and (4) Advanced Techniques. It was introduced by Phil Hetzel of Braintrust and emphasizes targeting known failure modes, extracting domain knowledge from human annotators before automating, validating LLM-as-judge outputs, and creating a continuous flywheel between production traces and offline experimentation.
What are the four maturity levels in the Hetzel eval framework?
The four levels are: Level 1 — Just Getting Started (structured vibe checking with documented human annotation), Level 2 — Measuring to Manage (deriving failure modes from annotations and building deterministic or LLM-as-judge scoring functions), Level 3 — Accounting for Complexity (handling tool calls, CRUD operations, external system state, and full trace evaluation), and Level 4 — Advanced Techniques (topic modelling across production traces to surface emerging failure modes and fully automated eval pipelines).
How do I start evaluating my LLM agent if I have no eval system?
Start with structured vibe checking at Level 1. Give your agent 10–20 representative inputs and have a subject matter expert — not the engineer who built it — review each output. For every output, record a thumbs up or thumbs down plus a written justification explaining the verdict. The justification is more important than the score because it externalizes domain knowledge you'll need to build automated scoring functions later. This is a legitimate starting point; don't wait for a perfect system before beginning.
How do I build an LLM-as-judge scoring function?
First, collect human annotation justifications that describe why specific agent outputs failed. Use those justification descriptions as the basis for your judge prompt — they contain the domain-specific language and criteria that define quality. Then critically, you must eval the eval: build a ground truth dataset of human-labelled outputs and measure how well your LLM-as-judge aligns with human decisions. Only trust the judge at scale once it demonstrates acceptable agreement with expert reviewers.
How does the Hetzel Eval Maturity Framework compare to just writing unit tests for my AI agent?
Evals are fundamentally different from unit tests. Unit tests aim for exhaustive coverage of deterministic code paths, while the Hetzel framework targets known, high-priority failure modes identified by subject matter experts. Trying to exhaustively cover every possible failure scenario for an LLM agent is infinite and unproductive. Additionally, eval results can be directional rather than pass/fail — as long as scores trend in the right direction as you iterate, that's sufficient. The framework also emphasizes replaying real production traces, not abstract hypothetical test cases.
When should I use the Hetzel Eval Maturity Phases Framework?
Use it whenever you're building or improving an evaluation system for an AI agent or LLM-powered application. It's particularly valuable in three situations: when you're stuck in proof-of-concept and can't bridge to production because you lack quality confidence, when your current eval approach is ad-hoc and you need to add structure to scale, or when you're running evals but not closing the loop between production failures and offline improvement. The framework applies to any LLM application — RAG pipelines, customer support agents, coding assistants, or multi-step tool-using agents.
What is the eval flywheel in the Hetzel framework?
The flywheel is a continuous improvement loop with five steps: (1) capture agent traces in production, (2) surface failures via human review or automated tooling, (3) pull failing examples into your offline eval dataset, (4) rerun evals against the updated dataset, and (5) use the results to guide your next agent improvement. This shifts evals from defensive (catching regressions) to playing offense — proactively using eval results to drive each incremental improvement and measure the impact of every change.
What does 'eval the eval' mean?
Eval the eval means validating your LLM-as-judge scoring functions against a human-labelled ground truth dataset. Just because you use a powerful LLM as a judge doesn't make it inherently trustworthy — as Hetzel puts it, putting a robe and a cloak on an LLM doesn't make it more reliable. You create a set of agent outputs manually labelled by a human expert, run your LLM-as-judge against them, and measure alignment. Only once the judge demonstrates acceptable accuracy should you trust it to score at scale.
How do I handle tool calls and external APIs in my eval system?
At Level 3, distinguish between context-gathering tools (read-only) and CRUD-based tools (create/update/delete on external systems). For CRUD tools, never write to production during eval runs. Instead, embed external system state directly into the captured trace payload, use timestamp or version queries to replay the state that existed when the original trace was captured, and use mock APIs to approximate production environments. This prevents corrupting production data while keeping evals representative of real conditions.
What results can I expect from implementing the Hetzel Eval Maturity Framework?
You can expect measurable, defensible confidence in your agent's quality before and during production. Specifically: a structured understanding of your agent's actual failure modes, scoring functions that scale human expertise beyond what manual review can cover, an eval dataset grounded in real production behavior rather than hypothetical scenarios, and a continuous improvement loop that turns production failures into agent improvements. Teams typically progress from subjective gut-feel quality assessments to quantifiable quality metrics that support confident iteration and stakeholder communication.
Why is the justification more important than thumbs up or thumbs down in eval annotation?
The justification captures the domain-specific reasoning behind a quality judgment — it externalizes tacit expert knowledge into written form. A thumbs down tells you something failed; the justification tells you why, which is what you need to build targeted scoring functions. These justifications become the raw material for deriving failure mode categories and for writing LLM-as-judge prompts that encode domain expertise. Without justifications, you lose the insight that makes your evals meaningful and cannot scale human knowledge into automated scoring.
Turn Any YouTube Video Into An AI Skill
SkillForge captures a creator's exact methodology from their video and turns it into a reusable AI skill you can invoke in Claude, ChatGPT, or any LLM.
Forge your own skill