Question 1

What does 'the model is already built' mean for agentic AI teams?

Accepted Answer

It means that unlike traditional ML, the foundational model — trained by Anthropic, OpenAI, Mistral, etc. — already exists. Your team's job is not to build a model from scratch but to implement, evaluate, and contextualise one. This fundamentally changes which skills matter most: context engineering and domain expertise outweigh data pipeline and model training skills for most agent use cases.

Question 2

What is the difference between a Traditional Enterprise and an AI Native in the Hetzel framework?

Accepted Answer

A Traditional Enterprise approaches agentic development by delegating it to an existing ML or data science platform team, typically because AI is categorised as an 'ML problem.' An AI Native built its entire offering around agents from the start, using small cross-functional teams. The key risk differs: Traditional Enterprises over-index on ML metrics and exclude non-technical experts; AI Natives risk under-engineering rigour and skipping formal eval processes.

Question 3

What is an agent trace and who should review it?

Accepted Answer

An agent trace is a logged record of an agent's execution steps, decisions, and outputs. Both technical and non-technical team members should review traces. Domain experts review traces during human annotation to label whether the agent performed well or poorly and explain why. Engineers review traces for debugging and observability. The Hetzel framework treats trace review by domain experts as structurally essential, not optional.

Question 4

How do I classify my organisation type before applying the Hetzel framework?

Accepted Answer

Ask two questions: Does your organisation have an existing ML or data science team that has been handed the agent mandate top-down? If yes, classify as Traditional Enterprise. Was your organisation built around agents from the start with small, cross-functional teams? If yes, classify as AI Native. This classification determines your default risk profile and shapes which team composition gaps to address first.

Question 5

How do I audit my current agent team for coverage gaps?

Accepted Answer

Map every team member to one of three personas: (1) Data Scientist / ML Engineer, (2) Product / Application / Systems Engineer, (3) Non-Technical Domain Expert or SME. If any persona is entirely absent, that is a critical gap. A team staffed only by ML engineers is the most common warning sign. The Hetzel framework requires all three personas contributing at defined stages of agent development and evaluation.

Question 6

How do I set up a human annotation workflow for AI agents?

Accepted Answer

Recruit domain experts who understand the problem the agent solves. Have them review agent traces — the logged execution steps and outputs — and label each trace as good or poor performance. Critically, require them to explain why. This generates grounded evaluation signal. Feed this labelled data back into your eval pipeline to validate LLM-as-judge assessments. Do not treat human annotation as cosmetic; it is the foundation of trustworthy evals.

Question 7

How do I validate LLM-as-judge evaluations?

Accepted Answer

Use human-labelled ground truth datasets to check whether your LLM-as-judge assessments converge with or diverge from human agreement over time. Data scientists should calculate precision, recall, and F1 specifically for the judge's alignment with human labels — not for the agent itself. If the judge drifts from human agreement, adjust the judge's prompt or swap the underlying model. Never let LLM-as-judge evals run unchecked.

Question 8

My agent works great in testing but fails in production — what went wrong?

Accepted Answer

Confidence built during experimentation does not automatically transfer to production. Real users confront the agent with scenarios your evals did not anticipate. The Hetzel framework requires an observability pipeline that monitors agent behaviour post-deployment and feeds production data back into the offline evaluation dataset. If you skipped observability, you are flying blind. Build the feedback loop between production behaviour and experimentation immediately.

Question 9

Our ML team built the agent but users say it doesn't actually solve their problem — why?

Accepted Answer

This is the classic Traditional Enterprise failure mode. Your ML team likely optimised for technical metrics — precision, recall, F1 — without evaluating functional performance: does the agent actually accomplish its intended purpose for real users? The Hetzel framework prescribes bringing domain experts into prompt/context engineering and human annotation to close this gap. The people closest to the problem must shape the agent's behaviour, not just the engineers.

Question 10

Our LLM-as-judge scores are high but users are unhappy — what's happening?

Accepted Answer

Your LLM-as-judge has likely drifted from human agreement. Judges are just prompts and models — they can diverge from what real humans consider good performance. Validate judge outputs against human-labelled ground truth using precision, recall, and F1 for judge-human alignment. If divergence is detected, revise the judge prompt, switch the judge model, or expand your human annotation dataset. This validation loop is a core responsibility of data scientists in the Hetzel framework.

Question 11

How does the Hetzel framework compare to standard MLOps team structures?

Accepted Answer

Standard MLOps teams are built around model training, feature engineering, and deployment pipelines. The Hetzel framework argues this structure is wrong for agents because the model is already built. Instead of data pipeline specialists, you need context engineers (domain experts), LLM-as-API integration specialists (product engineers), and eval/guardrail validators (data scientists in a redirected role). The team shape is fundamentally different because the work is fundamentally different.

Question 12

How is the Hetzel framework different from just hiring full-stack AI engineers?

Accepted Answer

Full-stack AI engineers can cover technical breadth but typically lack deep domain expertise — the 'proximity to the problem' that the Hetzel framework identifies as disproportionately valuable. A team of full-stack engineers without domain experts will build technically sound agents that miss contextual grounding. The framework explicitly requires non-technical domain experts as co-equals, not as consultants. The answer is always a deliberate mix, not a single generalist role.

Question 13

Does the Hetzel framework apply to multi-agent systems with supervisor and sub-agents?

Accepted Answer

Yes, and multi-agent architectures increase the importance of product/application/systems engineers. Distributed agent systems — where a supervisor orchestrates sub-agents across different infrastructure and compute — are fundamentally a systems engineering problem. The Hetzel framework assigns product engineers responsibility for managing this infrastructure complexity, building observability across agent boundaries, and ensuring the eval pipeline covers the full multi-agent interaction, not just individual agent outputs.

Question 14

When should I actually fine-tune a model instead of using context engineering?

Accepted Answer

Fine-tune only when context engineering has been exhausted and the use case genuinely requires specialised model behaviour that prompting cannot achieve — for example, domain-specific reasoning patterns, highly specialised output formats, or latency-sensitive deployments where smaller fine-tuned models outperform large general-purpose ones. The Hetzel framework treats fine-tuning as rare and assigns it exclusively to data scientists or ML engineers. Most teams should invest in better prompts and context before considering fine-tuning.

Question 15

How do I handle the politics of telling ML engineers they shouldn't lead agent development?

Accepted Answer

The Hetzel framework does not sideline ML engineers — it redirects them to their highest-value contribution: guardrails, statistical literacy, eval validation, and fine-tuning when genuinely needed. Frame it as role clarification, not demotion. ML engineers are the 'adults in the room' on LLM risk. The key message is that agent development requires a different team shape because the work is fundamentally different from traditional ML — the model is already built.

Question 16

Can I use the Hetzel framework for a team of one?

Accepted Answer

Yes, but you must wear all three hats consciously. Use the framework as a checklist: Am I validating eval quality with statistical rigour (data scientist hat)? Am I building reliable infrastructure and observability (engineer hat)? Am I deeply connected to the problem this agent solves (domain expert hat)? Solo builders most commonly skip the domain expert and eval validation roles. The framework helps you identify which hat you're neglecting before it becomes a production problem.

Question 17

What role should a product manager play in an agentic AI team?

Accepted Answer

Product managers bridge the domain expert and engineering personas. In the Hetzel framework, PMs help define functional performance criteria — what 'good' looks like from the user's perspective — and ensure domain experts have meaningful control over context engineering decisions. PMs also pressure-test whether the team has sufficient proximity to the problem. They should not be the sole voice of the user; actual domain experts and SMEs must be involved directly.

Question 18

What is functional performance and how is it different from precision and recall?

Accepted Answer

Functional performance evaluates whether an agent actually accomplishes its intended purpose for real users — did it resolve the customer's query, produce a correct legal summary, or complete the workflow correctly? Precision, recall, and F1 are technical metrics suited to narrow ML classification tasks. Agents operate across a much broader surface area, and functional performance captures user-facing quality that technical metrics miss. The Hetzel framework prioritises functional performance as the primary eval signal.

Question 19

How often should I update my agent eval dataset with production data?

Accepted Answer

Continuously. The Hetzel framework prescribes using production data — captured through the observability pipeline — to expand the offline evaluation dataset on an ongoing basis. Real usage surfaces edge cases and failure modes that pre-production evals cannot anticipate. Set up an automated or semi-automated pipeline that flags interesting production traces for human review and annotation by domain experts, then fold those labelled examples into your eval suite.

Question 20

What happens if I skip domain experts and just use engineers for prompt engineering?

Accepted Answer

Engineers without domain expertise will write prompts that are technically correct but contextually shallow. The agent may produce fluent outputs that miss critical domain nuances — regulatory requirements, industry-specific terminology, or user expectations that only someone close to the problem would catch. The Hetzel framework's 'Proximity to the Problem' principle exists specifically to prevent this: the people who understand what the agent is meant to solve must shape its prompts and context.

Question 21

Is the Hetzel framework only for LLM-based agents or does it apply to other AI systems?

Accepted Answer

The framework is specifically designed for agentic AI systems built on pre-trained LLMs — where the model is already built and the team's job is implementation, evaluation, and contextualisation. For traditional ML systems that require custom model training, feature engineering, and data pipeline work, standard MLOps team structures may be more appropriate. The key diagnostic is whether your team is building a model or implementing one.

Frequently Asked Questions About Hetzel Agent Team Composition Framework

// Basics