Frequently Asked Questions About Hetzel Agent Team Composition Framework

22 answers covering everything from basics to advanced usage.

// Basics

What is the difference between a Traditional Enterprise and an AI Native in the Hetzel framework?

A Traditional Enterprise approaches agentic development by delegating it to an existing ML or data science platform team, typically because generative AI is categorised as an 'AI problem.' An AI Native built its entire offering around agents from the start, with small, cross-functional, agile teams and no legacy ML platform. The classification matters because each carries a default risk: Traditional Enterprises over-index on ML metrics and under-include domain experts, while AI Natives risk under-engineering rigour and guardrails.

What is proximity to the problem and why does it matter for AI agents?

Proximity to the problem is the degree to which a team member understands what the agent is actually meant to solve. Domain experts and subject matter experts have the highest proximity. In the Hetzel framework, proximity determines who should lead context engineering and human annotation — the two activities with the greatest influence on agent quality. Teams with no one who deeply understands the end-user problem will build agents that are technically functional but contextually irrelevant.

What is the difference between evals and observability for AI agents?

Evals are evaluations performed during experimentation and development to build confidence in an agent before production deployment. Observability is monitoring the agent's behaviour after it goes live to maintain confidence as it encounters real users and unpredictable inputs. Together they form the two pillars of agent quality in the Hetzel framework. Confidence built in experimentation does not automatically transfer to production — real usage exposes scenarios that evals did not anticipate, which is why both are essential.

What is an agent trace and who should review it?

An agent trace is a logged record of an agent's execution steps, decisions, and outputs. In the Hetzel framework, agent traces are reviewed by both technical and non-technical evaluators. Domain experts review traces during human annotation workflows to label whether the agent performed well or poorly and explain why. Engineers use traces to debug execution failures and feed insights back into the eval pipeline. Traces are the raw material for both evaluation and observability.

// How To

How do I audit my current AI team using the Hetzel framework?

Map every current team member against three personas: (1) Data Scientists / ML Engineers, (2) Product / Application / Systems Engineers, and (3) Non-Technical Domain Experts or Subject Matter Experts. Identify which personas are missing or marginalised. A team staffed entirely by ML engineers is a warning sign. Then check whether anyone on the team deeply understands what the end agent is meant to solve — if not, you have a critical proximity-to-the-problem gap that must be fixed before building.

How do I assign roles to data scientists on an agentic AI team?

Assign data scientists three specific roles: (1) act as the 'adult in the room' on LLM risk and statistical literacy, (2) validate LLM-as-judge evals against labelled datasets using precision, recall, and F1, and (3) lead fine-tuning of open-source models only if the use case genuinely requires it. Critically, redirect them away from treating traditional ML metrics as the primary eval signal — those metrics are suited to a two-box ML pipeline, not to the broad functional surface area of agentic behaviour.

How do I set up a human annotation workflow for AI agents?

Recruit domain experts who understand the problem the agent is solving. Give them access to agent traces — the logged records of the agent's execution steps and outputs. Ask them to review each trace and label whether the agent performed well or poorly, and critically, explain why. This reasoning is as important as the label itself. Structure this as a recurring process, not a one-time activity. Use the labelled data to validate LLM-as-judge evals and continuously expand your offline evaluation dataset.

How do I build an eval pipeline for agentic AI?

Start by defining functional performance criteria jointly across all three personas — what does 'good' actually look like for the end user? Then implement LLM-as-judge evaluations as automated assessments of agent outputs. Validate the judge against human-labelled ground truth datasets produced by domain experts. Monitor whether the judge's ratings converge with or diverge from human agreement over time. Use production data to continuously expand your offline eval dataset, closing the loop between observability and experimentation.

// Troubleshooting

My AI agent performs well in testing but fails in production — what's wrong?

This is the observability gap the Hetzel framework explicitly warns about. Confidence built in experimentation does not transfer automatically to production because real usage confronts the agent with scenarios your evals did not anticipate. Fix this by building an observability pipeline that monitors agent behaviour post-deployment, surfaces failure patterns, and feeds production data back into your offline eval dataset. Also check whether your eval criteria are too narrow — if you're only measuring precision and recall, you're missing functional performance failures.

Our LLM-as-judge evals keep disagreeing with human reviewers — what should I do?

Divergence between LLM-as-judge evals and human reviewers means your automated judge is drifting from ground truth. First, validate the judge against a fresh set of human-labelled examples — the judge is just a prompt and a model, and it can be wrong. Adjust the judge's prompt to better align with the criteria human annotators are using. Track convergence metrics over time. If persistent disagreement remains, weight human labels more heavily and use the judge only as a screening tool, not a source of truth.

We staffed our agent team with only engineers and the agent isn't working — why?

You're missing the proximity-to-the-problem principle. Engineers build infrastructure and integrations but typically lack deep understanding of what the end agent should actually solve. Without domain experts, your context engineering will be generic, your eval criteria will be technically focused but functionally shallow, and your human annotation will lack the reasoning that makes it valuable. Bring in subject matter experts immediately — give them ownership of prompt/context engineering and human annotation before continuing development.

What happens if I skip the organisation type classification step?

Skipping organisation type classification means you won't recognise your default risk pattern. Traditional Enterprises default to over-indexing on ML metrics and under-including domain experts — if you don't identify this pattern, you'll perpetuate it. AI Natives default to moving fast without rigour — if you don't identify this, you'll ship agents without proper evals or guardrails. The classification shapes which corrective actions you prioritise first, so skipping it means you're likely solving the wrong problem.

// Comparisons

How does the Hetzel framework compare to Google's ML team structure recommendations?

Google's ML team guidelines are designed for traditional machine learning where the team builds, trains, validates, and deploys models. The Hetzel framework starts from a fundamentally different premise: in agentic AI, the model is already built by providers like OpenAI or Anthropic. This eliminates training as a core team competency and elevates product engineering (treating LLMs as APIs) and domain expertise (context engineering replaces feature engineering). The Hetzel framework is purpose-built for the post-foundation-model era, not adapted from traditional ML staffing.

How is the Hetzel framework different from a standard cross-functional product team?

A standard cross-functional product team includes designers, engineers, and PMs but doesn't typically prescribe specific roles tied to ML evaluation, context engineering, or human annotation workflows. The Hetzel framework adds two critical dimensions: (1) it explicitly defines what data scientists should and should not do on agent projects (no training, yes eval validation), and (2) it elevates domain experts from stakeholders to active builders who own prompt engineering and trace annotation. It's a product team structure specialised for the unique dynamics of agentic AI.

How does treating LLMs as APIs change who should build AI agents?

When you recognise that LLMs are APIs — send a payload, receive a response, make it useful — agent development becomes accessible to product and application engineers who are deeply experienced in exactly this integration pattern. This is a core Hetzel principle. It means you don't need a team of ML PhDs to build agents; you need engineers who can manage API integrations, build robust systems, and create feedback loops. The ML expertise is still needed but for eval validation and guardrails, not for the core build.

// Advanced

Can the Hetzel framework work for a single-person team building an AI agent?

The framework can inform a solo builder's priorities even if they can't staff three distinct personas. Focus on the principles: recognise the model is already built, invest in context engineering over feature engineering, evaluate functional performance not just technical metrics, and build observability from day one. Where you'll struggle is the proximity-to-the-problem gap — if you're building for a domain you don't deeply understand, finding even one domain expert advisor to review agent traces is critical.

How do I apply the Hetzel framework to a multi-agent system with supervisor and sub-agents?

Multi-agent architectures amplify the need for product/systems engineers because a supervisor orchestrating sub-agents on different infrastructure is fundamentally a distributed systems problem. The Hetzel framework assigns systems engineers ownership of this infrastructure layer. Domain experts remain essential — they evaluate whether the overall agent system accomplishes its purpose, not just whether individual sub-agents produce correct outputs. Data scientists validate that eval pipelines account for cascading failures across the agent chain.

When should I actually fine-tune a model instead of using context engineering?

Fine-tune only when context engineering has demonstrably hit its ceiling and the use case demands specialised behaviour that pre-built APIs cannot deliver — typically highly domain-specific language patterns, regulatory requirements, or performance constraints that require a smaller, self-hosted model. The Hetzel framework treats fine-tuning as the exception, not the default. If you're working with OpenAI or Anthropic APIs and your agent's behaviour can be improved by changing prompts and context, fine-tuning is premature optimisation.

How do I convince my ML team lead that domain experts should own prompt engineering?

Use the Hetzel framework's core argument: in agentic AI, the primary lever for changing agent behaviour is context engineering — adjusting prompts and inputs — not retraining or feature engineering. This shifts meaningful control toward people with deep domain knowledge. Show your ML lead that their highest-value contribution is eval validation and guardrails, not prompt writing. Frame it as a division of labour that maximises everyone's expertise rather than diminishing ML's role. The data scientist becomes the quality gatekeeper, which is a high-status position.

What does 'functional performance' mean for AI agents and how do I measure it?

Functional performance evaluates whether an agent actually accomplishes its intended purpose for real users — as distinct from technical metrics like precision, recall, and F1. To measure it, define success criteria from the user's perspective: Did the agent resolve the customer's query? Did it produce a correct legal summary? Did it complete the workflow end-to-end? Then build evals around these criteria using both LLM-as-judge automation and domain expert annotation. Functional performance is inherently broader and more context-dependent than technical metrics.

How often should I update my agent evaluation dataset?

Continuously. The Hetzel framework emphasises using production data to expand the offline evaluation dataset on an ongoing basis. Every time your observability pipeline surfaces a new failure mode or edge case in production, add it to your eval dataset. Domain experts should regularly annotate new agent traces from live usage. The eval dataset is a living artifact, not a static benchmark — if it stops growing, your evals are becoming stale and your confidence in agent quality is degrading.

Is the Hetzel framework only for teams using LLM APIs or does it work for self-hosted models?

The framework applies to both API-based and self-hosted model deployments. Its core principles — proximity to the problem, functional performance over technical metrics, context engineering as the primary lever — hold regardless of where the model runs. If you're self-hosting open-source models, the data scientist role expands to include fine-tuning oversight, and the systems engineering role becomes more demanding. But the fundamental team composition insight — you need all three personas — doesn't change based on deployment model.