Question 1

What is vibe checking in LLM evaluation?

Accepted Answer

Vibe checking is the earliest-stage eval practice where you informally review agent outputs without structured scoring. In the Hetzel framework, it becomes a legitimate methodology when paired with documented human annotation — a subject matter expert records both a verdict (thumbs up/down) and a written justification for each output. The justification captures domain knowledge that becomes the raw material for building automated scoring functions at later maturity levels.

Question 2

What is the difference between deterministic and LLM-as-judge scoring functions?

Accepted Answer

Deterministic scoring functions are code-based checks that catch failures with clear, rule-based criteria — like excessive token usage, too many tool calls, or format errors. LLM-as-judge scoring functions use a separate LLM to evaluate subjective or nuanced failure modes that can't be caught with simple rules. Deterministic functions are reliable but limited in scope; LLM-as-judge functions are flexible but must be validated against human ground truth to ensure they align with expert judgment before being trusted at scale.

Question 3

Why are justifications more important than thumbs up/down in eval annotation?

Accepted Answer

Justifications externalize domain-specific knowledge about what quality looks like. A thumbs-down alone tells you something failed but not why. The written justification — such as 'agent hallucinated a feature that doesn't exist' — captures the specific failure mode, which you can later feed into a coding assistant to systematically categorize failure patterns. These justifications also directly inform the prompt language for LLM-as-judge scoring functions. Without justifications, you lose the insight that makes scaling evaluation meaningful.

Question 4

How do I identify which maturity level I'm at in the Hetzel framework?

Accepted Answer

Assess your current eval practices honestly. Level 1: you're vibe checking — reviewing outputs informally with no structured scoring. Level 2: you have scoring functions and are deriving failure modes from human annotation. Level 3: your agent makes tool calls to external systems and you need to account for state complexity in evals. Level 4: you're running evals at production scale and need automated failure mode discovery. Start wherever you are — the framework is designed for incremental progression.

Question 5

How do I extract failure modes from human annotation justifications?

Accepted Answer

Feed your collected thumbs-down justifications into a coding assistant like Cursor, Claude Code, or Codex. Ask it to systematically extract and categorize the failure modes embedded in the justifications. The output should be a structured list of distinct failure modes your agent actually exhibits — such as hallucination, incorrect escalation, or wrong tool selection. Each failure mode then becomes the target for a specific scoring function, either deterministic or LLM-as-judge.

Question 6

How do I build a ground truth dataset for evaluating my LLM-as-judge?

Accepted Answer

Select a representative sample of your agent's outputs spanning both good and bad cases. Have a human domain expert manually label each output with the same scoring criteria your LLM-as-judge uses — pass/fail, 1–5 scale, or whatever your scoring function outputs. Record the expert's rationale for each label. This labelled dataset becomes your ground truth. Run your LLM-as-judge against it and measure agreement. Because LLM-as-judge outputs are discrete, you can directly compute accuracy, precision, and recall against the human labels.

Question 7

How do I set up the eval flywheel in practice?

Accepted Answer

Instrument your agent to capture full traces in production — every input, output, tool call, and intermediate step. Set up a review pipeline where human reviewers or automated tooling flag failures. Create a process to pull flagged traces into your offline eval dataset. Schedule regular eval runs against this growing dataset. After each eval run, use the results to prioritize your next agent improvement. Measure the impact of each change by rerunning evals. Automate this loop with CLI tooling as you mature.

Question 8

What should I do if my LLM-as-judge scores don't align with human expert judgment?

Accepted Answer

Iterate on your judge prompt. Review the specific cases where the judge disagrees with humans and look for patterns — is the judge too lenient, too strict, or misunderstanding certain failure modes? Refine the scoring criteria in your prompt using the exact language from human justifications. Consider splitting a single judge into multiple specialized judges for different failure modes. If alignment remains poor for a specific failure mode, keep that one as a human-review checkpoint rather than automating it prematurely.

Question 9

My eval dataset is too small — what should I do?

Accepted Answer

If you don't have production traces yet, start with UAT-level traces from internal testing. Run your agent through realistic scenarios that mirror expected production usage and capture those traces. You can supplement with hand-crafted examples for known edge cases, but prioritize getting real or near-real traces as quickly as possible. As soon as your agent reaches production or even a beta environment, instrument trace capture and begin replacing synthetic examples with real ones. A small dataset of real traces is more valuable than a large dataset of hypothetical ones.

Question 10

How do I handle eval runs that accidentally write to production systems?

Accepted Answer

This is a Level 3 CRUD-based tool call problem. Set up mock APIs that accept tool calls without writing real data. Embed relevant external system state — like available calendar slots or database records — directly into your captured trace payload so the eval reads from the trace rather than querying live systems. If your external systems support timestamp or version queries, use them to replay the exact state that existed when the original trace was captured. Never run eval pipelines with production write access.

Question 11

How does the Hetzel Eval Maturity Framework compare to generic prompt testing?

Accepted Answer

Generic prompt testing typically means running a few examples through your prompt and eyeballing the outputs — essentially permanent Level 1 with no progression path. The Hetzel framework provides a structured maturity continuum that takes you from that starting point through human annotation with justifications, derived failure modes, validated LLM-as-judge scoring, production trace integration, and automated flywheel improvement. It also addresses complexities that generic testing ignores entirely, like CRUD tool calls, full trace evaluation, and topic modelling for failure mode discovery.

Question 12

How does this framework compare to traditional software testing approaches for AI?

Accepted Answer

Traditional software testing aims for exhaustive coverage — every branch, every edge case. The Hetzel framework explicitly rejects this for LLM evaluation because the failure space is infinite. Instead, it targets known, high-priority failure modes identified by subject matter experts. It also embraces non-determinism: eval results don't need 100% accuracy, just directional trending. Traditional testing uses synthetic fixtures; this framework prioritizes real production traces. The eval flywheel concept has no direct analog in traditional testing — it's a continuous feedback loop unique to AI system evaluation.

Question 13

Can I use the Hetzel Eval Maturity Framework for RAG pipelines?

Accepted Answer

Yes. RAG pipelines are a natural fit because they have clearly identifiable failure modes — retrieval relevance, answer faithfulness to retrieved context, hallucination beyond retrieved documents, and incomplete information synthesis. Apply the framework by having an SME annotate RAG outputs with justifications, derive failure modes like 'answer contradicts retrieved passage' or 'relevant document not retrieved,' build LLM-as-judge functions targeting those modes, and validate the judge against human ground truth. The flywheel works well because production RAG queries provide a natural stream of real traces.

Question 14

How do I evaluate multi-step AI agents with the Hetzel framework?

Accepted Answer

For multi-step agents, you cannot only score the final response — you must instrument and capture the entire trace, including every tool call, intermediate decision, and step. Target scoring functions at individual steps, not just the end result. Individual tool calls or MCP calls can each be a failure vector. You need platform tooling capable of ingesting and querying arbitrarily large traces. This is a Level 3 concern and requires distinguishing which steps are context-gathering versus CRUD-based to handle eval safety appropriately.

Question 15

What is topic modelling in the context of LLM evals?

Accepted Answer

Topic modelling in the Hetzel framework (Level 4) means running automated analysis across production traces at scale to surface emerging failure modes you didn't anticipate. Rather than relying solely on manual failure mode identification by SMEs, you use statistical or ML-based topic modelling to cluster production traces and discover new patterns of failure. This is combined with automated eval pipeline execution via CLI tooling to run evals continuously. It's how mature teams stay ahead of novel failure modes as their agent encounters new types of real-world inputs.

Question 16

What is the minimum viable eval setup for getting an LLM agent to production?

Accepted Answer

At minimum, complete Level 1 and Level 2. Have a subject matter expert review 10–20 representative outputs with documented justifications. Derive your top 3–5 failure modes from those justifications. Build at least one deterministic and one LLM-as-judge scoring function targeting those failure modes. Validate the LLM-as-judge against a small human ground truth set. Run your eval against a dataset that includes real or UAT-level traces. This gives you measurable, defensible quality metrics — enough to justify a production launch with monitoring in place.

Question 17

Why shouldn't I wait until my eval system is perfect before launching?

Accepted Answer

Waiting for perfection is explicitly called out as a pitfall in the Hetzel framework. Vibe checking with documented human annotation is a legitimate starting point and is far better than no evals. Eval results don't need to be perfect — they need to be directional. If scores trend in the right direction as you iterate, that's sufficient. The flywheel depends on production traces, which you can't get without launching. Starting imperfect and iterating is the entire philosophy — the framework is designed for incremental maturation, not big-bang perfection.

Question 18

How many examples do I need in my eval dataset?

Accepted Answer

Start with at least 10–20 representative examples at Level 1 for vibe checking. As you progress to Level 2, aim for enough examples to cover each identified failure mode with multiple instances. The Hetzel framework emphasizes quality over quantity — 30 real production traces are more valuable than 300 synthetic examples. As you activate the flywheel, your dataset grows continuously from production. There's no magic number; the goal is that your dataset represents the actual distribution of inputs your agent encounters in production.

Question 19

Can I automate the entire eval process without any human involvement?

Accepted Answer

Not entirely, and attempting to do so is a pitfall. Humans are essential at multiple points: initial vibe checking and justification writing, building the ground truth dataset for validating LLM-as-judge, and periodic review of production traces to catch failure modes that automated systems miss. At Level 4 you can automate pipeline execution and failure mode discovery via topic modelling, but human expertise remains the foundation that the automation is built on. The framework automates scale, not judgment.

Question 20

What tools do I need to implement the Hetzel Eval Maturity Framework?

Accepted Answer

At Level 1, you only need a spreadsheet or simple annotation tool for recording verdicts and justifications. At Level 2, you need a coding assistant (Cursor, Claude Code, Codex) for failure mode extraction and a way to write and run scoring functions. At Level 3, you need trace capture and ingestion tooling, mock API infrastructure, and a platform that can query large traces. At Level 4, you need CLI-based eval pipeline automation and topic modelling capabilities. Braintrust is one platform purpose-built for this workflow, but the framework is tool-agnostic.

Question 21

What's the difference between context-gathering and CRUD-based tool calls in evals?

Accepted Answer

Context-gathering tools are read-only — they pull data into the LLM context without modifying external systems (e.g., searching a knowledge base, reading a user profile). These are lower risk in eval environments. CRUD-based tools create, read, update, or delete data in external systems (e.g., booking appointments, sending emails, updating records). These are high risk because eval runs can corrupt production data. The Hetzel framework requires you to mock CRUD tools or embed system state in traces to run evals safely.

Frequently Asked Questions About Hetzel Eval Maturity Phases Framework

// Basics