How Do AI Team Leads Build Eval Processes That Scale?
For ML/AI team leads and engineering managers · Based on Hetzel Eval Maturity Phases Framework
// TL;DR
The Hetzel Eval Maturity Phases Framework gives AI team leads a structured process for building evaluation systems that scale from early prototyping through production. It answers the management question: 'How do we know this agent is good enough to ship?' Use it to establish quality baselines, create repeatable eval processes across your team, manage reputational and compliance risk, and build a continuous improvement flywheel that turns production data into measurable agent improvements. The framework explicitly addresses organizational challenges like extracting domain knowledge from SMEs and validating automated scoring.
How do you move your team from ad-hoc agent review to measurable quality?
Most AI teams start with informal review — engineers check their own outputs and declare things 'good enough.' The Hetzel Eval Maturity Phases Framework gives you a clear organizational progression from this starting point.
The key insight for team leads: the framework's Level 1 requires human annotation by a subject matter expert, not the engineer who built the agent. This separation is critical. Builders have blind spots; SMEs have domain knowledge about what quality actually means to users. Your first organizational action is establishing this separation and requiring written justifications alongside every verdict.
What organizational structures support each maturity level?
Level 1 — Structured Vibe Checking: Assign an SME reviewer for each agent. Create a simple annotation workflow: 10–20 example inputs, thumbs up/down plus written justification per output. This can be done in a spreadsheet. The justifications are your team's most valuable eval artifact — treat them as institutional knowledge.
Level 2 — Measuring to Manage: Establish a process for deriving failure modes from annotation justifications. This can be done by feeding justifications into a coding assistant. Assign ownership of scoring functions — who builds them, who maintains them, who validates them against human ground truth. The 'eval the eval' practice needs to be a team norm, not an afterthought.
Level 3 — Accounting for Complexity: For agents with external dependencies, establish infrastructure standards: mock APIs for CRUD tools, trace capture requirements, state embedding protocols. This requires coordination between your AI team and infrastructure/platform teams.
Level 4 — The Flywheel: Operationalize the continuous improvement loop. Define who reviews production traces, how failures get pulled into eval datasets, and how eval results feed into sprint planning. The flywheel is an organizational process as much as a technical one.
How do you manage reputational and compliance risk with evals?
The Hetzel framework positions agent quality as the north star, which directly maps to risk management. Reputational risk: your eval scores provide a defensible basis for production decisions. Compliance risk: documented eval runs with human ground truth validation create an audit trail. Systems cost risk: deterministic scoring functions can flag excessive token usage or tool calls.
For team leads, the key deliverable is not just quality scores but a documented, repeatable process that stakeholders can audit. Each maturity level produces progressively stronger evidence of quality: Level 1 gives you annotated reviews, Level 2 gives you quantified scores, Level 3 gives you full trace evaluation, and Level 4 gives you continuous automated validation.
What's the biggest process mistake AI team leads make with evals?
Trying to jump to full automation before extracting domain knowledge from humans. The framework's principles are explicit: you must extract domain knowledge through justifications before you can scale it through LLM-as-judge. Teams that skip human annotation and go straight to automated scoring build judges that don't reflect what quality actually means in their domain.
The second biggest mistake: treating evals as a one-time gate rather than a continuous process. The flywheel concept — capture, identify, pull, rerun, improve — needs to be embedded in your team's operating rhythm, not treated as a pre-launch checkbox.
What's the next step for your team?
Audit your current state honestly across the four maturity levels. For each agent your team owns, identify: do we have documented human annotations with justifications? Do we have derived failure modes and scoring functions? Are we validating our judges? Are we capturing production traces? Build a maturity roadmap that advances each agent one level at a time, with clear ownership and timelines.
// FREQUENTLY ASKED QUESTIONS
How do I make the case for investing engineering time in evals?
Frame evals as risk management and iteration speed, not as overhead. Without evals, every agent change is a gamble — you can't measure whether it improved or regressed quality. With evals, every change is validated against real-world conditions. The Hetzel framework's flywheel specifically turns production failures into measured improvements, making each sprint's agent work more targeted and less wasteful. Evals also create the audit trail needed for compliance-sensitive deployments.
Who should be the human annotator — the engineer or someone else?
A subject matter expert, not the engineer who built the agent. The Hetzel framework is explicit about this: builders have blind spots about their own work. The SME holds domain-specific knowledge about what quality means to real users. If you don't have a dedicated SME, use someone from customer support, product, or the team that currently handles the task the agent is automating. Their justifications capture knowledge that no amount of engineering intuition can replace.
How do I know when my team is ready to move from one maturity level to the next?
Move to Level 2 when you have at least 10–20 annotated outputs with written justifications and can identify 3–5 distinct failure modes from them. Move to Level 3 when your agent has external system dependencies that make eval runs risky or inaccurate. Move to Level 4 when your eval dataset is growing from production traces and you need automated failure mode discovery to keep pace. Each transition is driven by need and capability, not a calendar.