How Do AI Startup Teams Build Eval Systems That Scale?
For AI engineering team leads at startups · Based on Hetzel Eval Maturity Phases Framework
// TL;DR
The Hetzel Eval Maturity Phases Framework helps AI startup team leads move from ad-hoc agent testing to a structured, scalable eval system. Start with documented human annotation at Level 1, derive failure modes and build scoring functions at Level 2, handle tool calls and trace complexity at Level 3, and automate failure discovery at Level 4. This is especially valuable when your team is stuck in proof-of-concept because nobody can prove the agent is production-ready, or when you need to communicate quality metrics to investors and stakeholders.
Why do AI startup teams get stuck between proof-of-concept and production?
The most common blocker is the inability to answer the question: "How do we know this agent is good enough to ship?" Without structured evals, your team relies on gut feel — the builder runs a few examples, the output looks fine, and you deploy. Then production users find the failure modes you missed.
The Hetzel Eval Maturity Phases Framework gives your team a concrete path from informal review to measurable, defensible quality. It starts exactly where most startups are — vibe checking — and provides a clear progression through four maturity levels.
How should a startup team lead begin implementing this framework?
Start at Level 1 regardless of how sophisticated you think your needs are. Select 10–20 representative inputs that reflect real user behavior. Have a subject matter expert — your product manager, a domain consultant, or an experienced user, but critically not the engineer who built the agent — review each output.
The key discipline: every review must include a written justification, not just thumbs up or thumbs down. "Thumbs down — agent recommended a pricing tier that doesn't exist" is actionable. "Thumbs down" alone is not. These justifications are the raw material for everything that follows.
As team lead, your job is to enforce this discipline even when the team pushes back that it's slow. The justifications are what let you scale human expertise into automated scoring at Level 2.
How do you move from vibe checking to measurable quality metrics?
Once you have 20+ annotated examples with justifications, feed the thumbs-down justifications into a coding assistant (Cursor, Claude, Codex) and ask it to extract and categorize failure modes. You'll get a structured list: hallucination, wrong tool selection, format errors, safety violations, etc.
For each failure mode, build a scoring function. Objective failures (too many API calls, wrong output format) get deterministic code-based scorers. Subjective failures (tone, accuracy, completeness) get LLM-as-judge scorers written using the justification language from your annotations.
Critically: eval the eval. Build a small ground truth dataset of human-labelled outputs and measure your LLM-as-judge alignment. This is how you earn the right to trust automated scoring.
How does the flywheel help a startup iterate faster?
The flywheel is where evals shift from a gate ("can we ship?") to an engine ("what should we improve next?"). Capture production traces, identify failures through human review or automated alerts, pull those failures into your offline eval dataset, rerun evals, and use results to guide the next improvement.
For startup teams, this is transformative. Every customer complaint becomes a concrete eval case. Every agent improvement gets measured. You can tell your stakeholders: "This change improved our accuracy score from 72% to 84% on 150 production-derived eval cases." That's the language that unlocks production deployments and investor confidence.
What should a startup team lead do next?
Audit your current eval maturity level using the four-stage framework. If you're at Level 0 (no evals at all), start Level 1 this week — it takes one afternoon. If you're at Level 1, prioritize extracting failure modes from existing annotations. Assign one engineer to own the eval system as infrastructure, not a side project. The eval system is as important as the agent itself — it's what makes confident iteration possible.
// FREQUENTLY ASKED QUESTIONS
How many engineers should work on the eval system at a startup?
At minimum, one engineer should own the eval system as dedicated infrastructure. At Level 1 and 2, this is part-time work — setting up annotation workflows and building scoring functions. At Level 3 and beyond, it may require more investment as you handle trace instrumentation, mock APIs, and automated pipelines. The key is treating evals as a first-class engineering concern, not a side project that gets deprioritized when shipping pressure increases.
Can a startup skip levels in the Hetzel framework to move faster?
No, each level builds on the outputs of the previous one. Level 2 scoring functions require Level 1 human annotation justifications. Level 3 trace-level evaluation requires Level 2 scoring functions. Skipping human annotation and jumping straight to LLM-as-judge means your judge lacks grounded criteria and you have no ground truth to validate it against. The framework is designed so each level can be implemented quickly — Level 1 takes an afternoon, Level 2 takes a few days.
How do I communicate eval results to non-technical stakeholders?
Focus on failure mode categories and trend lines. Instead of technical scoring details, present: 'Our agent has five identified failure modes. Here's how each score has trended over the last four iterations. Hallucination rate dropped from 18% to 4%.' The Hetzel framework's emphasis on targeting known failure modes rather than abstract test coverage makes results inherently more communicable — each failure mode maps to a concrete business risk that stakeholders understand.