Frequently Asked Questions About Hetzel Eval Maturity Phases Framework
21 answers covering everything from basics to advanced usage.
// Basics
How many examples do I need in my eval dataset to get started?
Start with 10–20 representative examples at Level 1. The goal at this stage is structured vibe checking, not statistical significance. As you progress to Level 2 and beyond, expand your dataset with real production traces or UAT-level traces. The Hetzel framework emphasizes that the quality and representativeness of your examples matters more than volume — a small dataset of real production traces is more valuable than a large dataset of synthetic hypotheticals. Scale the dataset as your eval maturity grows.
What's the difference between context-gathering tools and CRUD-based tools in evals?
Context-gathering tools are read-only — they inject data into the LLM context without modifying external systems (e.g., querying a knowledge base). CRUD-based tools create, read, update, or delete data in external systems (e.g., booking a calendar appointment, updating a database record). The distinction matters because CRUD tools pose a risk of corrupting production data during eval runs. The framework requires mock APIs or state isolation for CRUD tools, while context-gathering tools are lower risk and can often be replayed more directly.
Should I use synthetic data or production data for my eval dataset?
Production data is the gold standard. The Hetzel framework's core principle is that evals should approximate rerunning production, not running abstract hypothetical tests. Real production traces or at least UAT-level traces capture the actual input patterns, edge cases, and system states your agent encounters. Synthetic or hand-crafted examples are acceptable as a starting point at Level 1, but you should actively work toward replacing them with real traces. Using only synthetic data is listed as a specific pitfall to avoid.
What does 'playing offense with evals' mean?
Playing offense means using evals proactively to guide agent improvement rather than just defensively catching regressions. In the flywheel model, you capture production traces, identify failures, bring them into your offline eval dataset, rerun evals, and use the results to decide what to improve next. Every change you make to the agent gets measured through evals so you can quantify its impact. This contrasts with defensive evals where you only check whether existing functionality still works after a change.
How do I convince stakeholders that my AI agent is production-ready using this framework?
The framework produces concrete, quantifiable artifacts that support stakeholder conversations: a documented list of known failure modes, scoring functions with measurable results, LLM-as-judge outputs validated against human ground truth, and eval datasets built from realistic traces. You can show that specific failure modes are scored, that scores trend positively over iterations, and that the eval system itself has been validated. This moves the conversation from 'it seems to work fine' to 'we have measurable, defensible quality evidence across these specific risk dimensions.'
// How To
How do I evaluate a multi-step agent that makes many tool calls?
You cannot only score the final output. The Hetzel framework requires instrumenting and capturing the entire trace — every tool call, every intermediate step, every decision point. Then target scoring functions at individual steps, including individual tool or MCP calls, not just the end result. Each intermediate step can be a vector for failure. You need platform tooling capable of ingesting and querying arbitrarily large traces. This is a Level 3 capability that becomes essential as agent complexity grows.
How do I go from Level 1 vibe checking to Level 2 measuring to manage?
The transition requires three actions. First, take the written justifications from your Level 1 human annotations and systematically extract failure mode categories — use a coding assistant to help categorize them. Second, for each failure mode, build a scoring function: deterministic (code-based) for objective failures like format errors or excessive tool calls, and LLM-as-judge for subjective failures. Third, build a ground truth dataset and validate your LLM-as-judge against it. You now have measurable, repeatable scoring instead of informal review.
How do I identify failure modes if I'm just getting started?
Work with a subject matter expert — someone who understands the domain and the end users, ideally not the engineer who built the agent. Run 10–20 representative inputs through the agent and have the SME review each output with a thumbs up/down and a written justification. The justifications from thumbs-down verdicts naturally surface failure modes. Then feed these justifications into a coding assistant to systematically extract and categorize the failure patterns. This is more productive than trying to brainstorm every possible failure scenario in advance.
Can I use the Hetzel eval framework for a RAG pipeline?
Yes, it applies directly. A RAG pipeline has clearly identifiable failure modes (retrieval of wrong documents, hallucination beyond retrieved context, incomplete answers). Start at Level 1 with an SME reviewing RAG outputs and justifying verdicts. At Level 2, build scoring functions — deterministic ones for retrieval precision and LLM-as-judge for answer quality. At Level 3, handle the vector database as an external system dependency, embedding retrieval state into traces. The flywheel then brings real user queries from production into your eval dataset for continuous improvement.
What does a scoring function look like in the Hetzel framework?
Scoring functions come in two types. Deterministic scoring functions are code-based and check objective criteria: Did the agent make more than N tool calls? Did the response exceed a token limit? Is the output in the correct format? LLM-as-judge scoring functions use a separate LLM to evaluate subjective quality: Did the response accurately address the user's intent? Was the tone appropriate? The judge prompt should incorporate language directly from human annotation justifications to encode domain expertise. Both types produce scores that can be tracked over iterations to measure improvement.
// Troubleshooting
Can I skip straight to LLM-as-judge without doing human annotation first?
No, this is a critical mistake the framework warns against. Human annotators hold domain-specific knowledge about what quality looks like. You must extract that knowledge through documented justifications before automating with LLM-as-judge. The justifications from human review become the raw material for your judge prompts. Skipping this step means your LLM-as-judge lacks grounded criteria and you have no human ground truth dataset to validate its outputs against. Always start with structured human annotation, even if brief.
What if my LLM-as-judge disagrees with human experts?
This is exactly why the 'eval the eval' principle exists. When your LLM-as-judge disagrees with human ground truth, diagnose the source: is the judge prompt missing domain-specific criteria? Is the judge model too weak for the task? Are the human labels inconsistent? Refine the judge prompt using the language from your human annotation justifications, try a more capable judge model, or clean up inconsistencies in human labels. Re-measure alignment after each adjustment. Don't deploy a judge that hasn't demonstrated acceptable agreement with expert reviewers.
What's the biggest mistake teams make when building LLM eval systems?
Treating evals like unit tests — trying to exhaustively cover every possible failure scenario. This is the first pitfall the Hetzel framework identifies. The failure space of an LLM agent is effectively infinite, so attempting exhaustive coverage means you spend all your time writing tests and none shipping. Instead, focus on known, high-priority failure modes identified by subject matter experts. A second critical mistake is trusting LLM-as-judge outputs without validation against human ground truth. Both mistakes stem from applying traditional software testing mental models to non-deterministic AI systems.
What if I don't have a subject matter expert available to review outputs?
Having a true domain expert is strongly recommended but if unavailable, the next best option is someone who deeply understands the end users and use case — a product manager, customer success lead, or experienced support agent. The critical requirement is that the reviewer is not the engineer who built the agent, because builders have inherent bias toward their own outputs. Whoever reviews must provide written justifications, not just thumbs up/down. The justifications are what make later automation possible. Even imperfect human review is better than no structured annotation at all.
// Comparisons
What's the difference between evals and traditional software testing for AI agents?
Evals target known, high-priority failure modes rather than seeking exhaustive code path coverage. Traditional unit tests are deterministic and expect exact outputs; evals accept directional results where scores trending in the right direction is sufficient. Evals also incorporate subjective quality judgments via LLM-as-judge and human annotation, handle non-deterministic outputs, and ideally replay real production traces rather than synthetic test inputs. The Hetzel framework explicitly warns against treating evals like unit tests — it leads to spending all your time writing tests and none shipping.
How is the Hetzel framework different from just using an eval tool like Braintrust or LangSmith?
The Hetzel framework is a methodology, not a tool — it describes what to do and in what order regardless of which platform you use. Tools like Braintrust, LangSmith, or Arize provide the infrastructure for capturing traces, running scoring functions, and managing datasets. The framework tells you how to mature your eval practice: when to start with human annotation, how to derive failure modes, when to introduce LLM-as-judge, how to validate it, and how to close the production flywheel. You need both the methodology and the tooling; the framework guides your use of whatever tool you choose.
// Advanced
How accurate does my LLM-as-judge need to be?
The Hetzel framework explicitly states that 100% accuracy is not the goal. Eval results can be directional — as long as scores are trending in the right direction as you iterate on the agent, that is sufficient. The key requirement is that your LLM-as-judge has been validated against a human ground truth dataset and shows acceptable alignment. What counts as 'acceptable' depends on your use case and risk tolerance. For high-stakes compliance or safety evaluations, you need higher alignment; for general quality trending, moderate alignment with human judgments is workable.
How do I handle external system state when replaying production traces in evals?
Embed the relevant external system state directly into the captured trace payload so it travels with the trace. For systems that support it, use timestamp or version queries to replay the state that existed when the original trace was captured (e.g., querying a vector database at a specific point in time). For systems that don't support state versioning, use mock APIs to approximate the production environment. The goal is that your eval replay reflects the same conditions the agent encountered in production, not the current state of external systems.
What's the role of topic modelling in the Hetzel eval framework?
Topic modelling is a Level 4 (Advanced) technique used to automatically surface emerging failure modes you didn't anticipate. Instead of relying solely on manual failure mode identification, you run topic modelling across production traces at scale to discover patterns and clusters of problematic behavior. This is especially valuable as your agent handles high volumes and encounters edge cases that no human reviewer would catch through manual sampling. It complements — but doesn't replace — the human annotation process that grounds your eval system.
How often should I run evals in the flywheel model?
At Level 4 (Advanced), evals run in a fully automated, continuous manner via CLI tooling — triggered on every agent change, on a schedule, or as part of CI/CD. Before reaching that level, run evals at minimum before every agent deployment and whenever you pull new failing examples from production into your dataset. The flywheel is continuous: capture traces, identify failures, update the dataset, rerun evals, improve the agent. The cadence should match your iteration speed — if you ship daily, eval daily. The framework emphasizes that waiting for perfection is itself a pitfall.
How does this framework handle compliance and safety evals?
The framework treats compliance and safety as categories of failure modes — specific, identified ways your agent can produce bad outputs. They follow the same process: identify compliance-relevant failure modes with an SME (e.g., giving medical advice, exposing PII, violating financial regulations), build targeted scoring functions for each, and validate those scoring functions against human-labelled ground truth. The framework's principle that 'evals serve agent quality' explicitly includes managing compliance risk and reputational risk alongside functional quality. Higher-stakes failure modes may warrant stricter LLM-as-judge accuracy thresholds.