Hetzel Agent Team Composition Framework
Design the right cross-functional team mix to build production-ready agentic AI systems by applying Phil Hetzel's diagnostic for who should own, build, and evaluate agents in your organisation.
// TL;DR
The Hetzel Agent Team Composition Framework is a diagnostic for staffing agentic AI projects correctly. It identifies three essential personas — data scientists/ML engineers, product/application/systems engineers, and non-technical domain experts — and assigns each a specific role in building, evaluating, and contextualising production agents. Use it whenever your organisation is deciding who should own agentic AI development, especially when an existing ML team has been handed agent responsibility by default or when you're staffing a new AI initiative from scratch. The framework prevents the most common mistake: treating agent-building as a traditional ML problem and staffing it accordingly.
// When should you use the Hetzel Agent Team Composition Framework?
Use this skill whenever a team is deciding who should own agentic AI development — especially when an existing ML or data science team has been handed agent-building responsibility by default, or when a new AI initiative is being staffed from scratch.
// What inputs do you need before applying the Hetzel framework?
- Organisation Typerequired
Is the organisation a Traditional Enterprise (existing ML/DS teams, delegated mandate) or an AI Native (built around agents from the start)? - Agent Use Caserequired
What problem is the agent meant to solve? Who are the end users? - Current Team Compositionrequired
Who is currently assigned to build the agent — data scientists, ML engineers, product engineers, subject matter experts, PMs, or some mix? - Agent Architecture Complexity
Is this a single agent or a distributed multi-agent system (supervisor + sub-agents on different infrastructure)? - Fine-Tuning Requirement
Does the use case require fine-tuning an open-source model, or is the team working entirely with pre-built LLM APIs?
// What are the core principles behind the Hetzel Agent Team Composition Framework?
The Model Is Already Built
Unlike traditional ML, the data pipeline of training and testing has already been done by Anthropic, OpenAI, Mistral, et al. The team's job is not to build a model — it is to implement, evaluate, and contextualise one. This fundamentally changes which skills are most valuable.
Proximity to the Problem
The people closest to the problem the agent is meant to solve — domain experts, subject matter experts, product managers — hold disproportionate value in agentic development. Proximity to the problem determines quality of context engineering and human annotation.
Functional Performance Over Technical Metrics
Evaluating agents requires assessing functional performance across a far broader surface area than traditional ML. Locking on to precision, recall, and F1 alone is a trap — those are technical metrics for a two-box ML pipeline, not for agentic behaviour.
LLMs Are Just APIs
From a product engineering perspective, LLMs are APIs: send a payload, receive a response, make it useful to the end user. This reframe opens agent development to product and application engineers who are deeply experienced in exactly this pattern.
Context Engineering, Not Feature Engineering
The primary lever for changing agent behaviour is changing the inputs — prompts and context — rather than retraining or feature engineering. This shifts meaningful creative control toward those with deep domain knowledge, not just technical depth.
The Answer Is Always in the Middle
No single discipline owns agents. The ideal team is deliberately diverse: data scientists, product/application/systems engineers, and non-technical domain experts each contribute irreplaceable value at different stages of building and evaluating agents.
// How do you apply the Hetzel framework step by step?
- 1
Classify the organisation type
Determine whether this is a Traditional Enterprise (ML/DS team handed the agent mandate top-down) or an AI Native (small, cross-functional, agile team built around agents). The classification shapes the default risk: Traditional Enterprise teams tend to over-index on ML metrics and under-include non-technical experts; AI Natives risk under-engineering rigour and guardrails.
- 2
Audit the current team composition for coverage gaps
Map current team members against three personas: (1) Data Scientists / ML Engineers, (2) Product / Application / Systems Engineers, (3) Non-Technical Domain Experts or Subject Matter Experts. Identify which personas are missing or marginalised. A team staffed entirely by ML engineers is a warning sign.
- 3
Assign Data Scientists / ML Engineers their agent-specific role
Their role is NOT to train the model — that is already done. Assign them to: (a) act as the 'adult in the room' on LLM risk and statistical literacy, (b) validate LLM-as-judge evals against labelled datasets using recall, precision, and F1, and (c) lead fine-tuning of open-source models if the use case genuinely requires it. Redirect them away from obsessing over traditional ML metrics as the primary eval signal.
- 4
Assign Product / Application / Systems Engineers their agent-specific role
These engineers implement requirements into the product, manage the systems and infrastructure where agents execute (especially critical for distributed multi-agent architectures with supervisor and sub-agents on different compute), and build the eval and observability pipelines that close the feedback loop between production and experimentation.
- 5
Assign Non-Technical Domain Experts their agent-specific role
These people have the most proximity to the problem. Give them meaningful control over prompt and context engineering — the primary lever for changing agent behaviour. Also deploy them in human annotation workflows: they should review agent traces and label whether the agent performed well or poorly, and critically, explain WHY. Do not treat this as optional or cosmetic.
- 6
Define the eval and observability pipeline jointly
Evals (pre-production experimentation) and observability (post-production monitoring) are the two pillars of agent quality. The team must agree on what 'good' looks like functionally — not just technically. Use production data to continuously expand the offline evaluation dataset. Check whether LLM-as-judge evals are converging toward or diverging from human agreement over time.
- 7
Pressure-test the team against the 'Proximity to the Problem' principle
Ask: does the team have at least one person who deeply understands what the end agent is actually meant to solve? If the team is entirely engineers with no domain expert involved, the agent will lack the contextual grounding needed to be relevant. Fix this before building, not after.
// What does the Hetzel framework look like in real-world scenarios?
A large financial services firm assigns its existing ML platform team to build a customer-facing agent that answers account queries, because 'it has AI in the name'.
Classify as Traditional Enterprise. Audit the team — likely heavy on ML engineers, missing product engineers and domain experts (customer service specialists, compliance officers). Redirect ML engineers to guardrail and eval validation roles. Bring in product engineers to manage the LLM-as-API integration and systems infrastructure. Recruit customer service SMEs for prompt/context engineering and human annotation of agent traces. Redefine eval criteria beyond precision/recall to include functional performance — does the agent actually resolve the customer's query correctly and safely?
An AI-native startup building a legal research agent has a small team of generalist engineers who are moving fast but have no formal eval process.
Classify as AI Native. The proximity-to-the-problem advantage is present — engineers are close to the use case. The gap is rigour. Add a data scientist or someone with a stats background to build guardrails and design LLM-as-judge eval pipelines validated against labelled data. Formalise human annotation by involving a legal domain expert who reviews agent traces and labels correctness with reasoning. Build an observability pipeline so production behaviour feeds back into the offline eval dataset continuously.
// What mistakes should you avoid when staffing an agentic AI team?
- Handing agentic development entirely to ML or data science teams because 'it has AI in the name' — this is the most common Traditional Enterprise mistake and leads to teams optimising for the wrong metrics.
- Obsessing over precision, recall, and F1 as the primary eval signals for agents — these are technical metrics suited to a two-box ML pipeline, not to the broad functional surface area of agentic behaviour.
- Ignoring non-technical domain experts or treating their input as cosmetic — these people hold the most proximity to the problem and are the primary contributors to prompt/context engineering and human annotation quality.
- Treating fine-tuning as the default approach — fine-tuning open-source models is rare and should only be pursued when the use case genuinely demands it; most agent behaviour is changed via context engineering, not retraining.
- Skipping the observability pipeline post-production — confidence in an agent built in experimentation does not transfer automatically to production; real usage confronts the agent with scenarios that evals did not anticipate.
- Letting LLM-as-judge evals run unchecked without validating them against human-labelled ground truth — judges are just prompts and models; they can drift from human agreement without a self-check mechanism.
- Treating agents as just another predictive model and applying the traditional cross-validation and AB-testing dance — the pipeline is entirely different once the model is already built.
// What are the key terms in the Hetzel Agent Team Composition Framework?
- Agent Quality
- The discipline of ensuring an agent performs correctly both before and after production, comprising two pillars: evals and observability.
- Evals
- Evaluations performed during experimentation and development to build confidence in an agent's execution before it is pushed to production.
- Agent Observability
- Monitoring an agent's behaviour after it is in production to maintain confidence in its execution as it encounters real users and real usage.
- Proximity to the Problem
- The degree to which a team member understands what the end agent is actually meant to solve. Higher proximity — typically found in domain experts and SMEs — leads to better context engineering and annotation quality.
- Context Engineering
- The primary lever for changing agent behaviour: adjusting the prompts, context, and inputs fed to a pre-built LLM rather than retraining or feature engineering.
- LLM-as-Judge
- Using a language model to evaluate the outputs of an agent as part of the eval process. Requires validation against labelled datasets to ensure the judge itself is trustworthy.
- Human Annotation Workflow
- A structured process in which domain experts review agent traces and label whether the agent performed well or poorly — and explain why — to generate grounded training and evaluation signal.
- Agent Trace
- A logged record of an agent's execution steps, decisions, and outputs that can be reviewed by technical or non-technical evaluators.
- Traditional Enterprise
- An organisation that approaches agentic development by delegating it to an existing ML or data science platform team, typically because generative AI is categorised as an 'AI problem'.
- AI Native
- An organisation that built its entire offering around agents from the start, typically characterised by small, cross-functional, agile teams with high proximity to the problem and no legacy ML platform.
- Functional Performance
- Evaluation of whether an agent actually accomplishes its intended purpose for real users — as distinct from technical performance metrics like precision, recall, and F1.
- Distributed Agent / Supervisor + Sub-Agents
- A multi-agent architecture in which a supervisor agent orchestrates multiple child or sub-agents running on different infrastructure, calling different systems — a complex systems engineering problem.
// FREQUENTLY ASKED QUESTIONS
What is the Hetzel Agent Team Composition Framework?
The Hetzel Agent Team Composition Framework is a diagnostic for designing cross-functional teams that build production-ready agentic AI systems. It classifies organisations as Traditional Enterprise or AI Native, audits current team composition against three essential personas (data scientists, product engineers, and domain experts), and assigns each persona a specific role in agent development. The framework is based on Phil Hetzel's insight that agents are fundamentally different from traditional ML — the model is already built, so the team's job is to implement, evaluate, and contextualise it.
What is context engineering in agentic AI?
Context engineering is the primary lever for changing agent behaviour: adjusting the prompts, context, and inputs fed to a pre-built LLM rather than retraining or feature engineering. In the Hetzel framework, context engineering is owned primarily by non-technical domain experts because they have the deepest proximity to the problem the agent is meant to solve. This shifts meaningful creative control toward those with domain knowledge rather than exclusively technical depth, which is a fundamental departure from traditional ML workflows.
How do I decide who should build AI agents at my company?
Start by classifying your organisation as a Traditional Enterprise or AI Native, then audit your current team against three personas: data scientists/ML engineers, product/application/systems engineers, and non-technical domain experts. If any persona is missing, you have a critical gap. Assign data scientists to eval validation and guardrails, product engineers to LLM-as-API integration and infrastructure, and domain experts to prompt/context engineering and human annotation. Never let a single discipline own the entire agent lifecycle.
How do you evaluate agentic AI systems differently from traditional ML models?
Evaluate agents on functional performance — whether the agent actually accomplishes its intended purpose for real users — not just precision, recall, and F1. Those technical metrics suit a two-box ML pipeline, not the broad behavioural surface area of agents. Build two pillars: evals (pre-production experimentation) and observability (post-production monitoring). Use LLM-as-judge pipelines validated against human-labelled ground truth, and continuously expand your offline eval dataset with production data.
How does the Hetzel framework compare to traditional ML team structures?
Traditional ML team structures centre data scientists around training, testing, and deploying models. The Hetzel framework recognises that in agentic AI, the model is already built by providers like OpenAI or Anthropic. This eliminates training as a core competency and elevates product engineering (LLMs are APIs) and domain expertise (context engineering replaces feature engineering). The result is a deliberately cross-functional team rather than an ML-centric one, with domain experts holding disproportionate value rather than being peripheral stakeholders.
When should I use the Hetzel Agent Team Composition Framework?
Use it whenever a team is deciding who should own agentic AI development. The most critical moment is when an existing ML or data science team has been handed agent-building responsibility by default — typically because executives categorise it as an 'AI problem.' Also use it when staffing a new AI initiative from scratch, when expanding a single-agent project to a multi-agent architecture, or when agent quality issues in production suggest the team lacks the right composition.
What role do domain experts play in building AI agents?
Domain experts hold disproportionate value in agentic development because they have the most proximity to the problem the agent is meant to solve. The Hetzel framework assigns them two critical responsibilities: leading prompt and context engineering (the primary lever for changing agent behaviour), and performing human annotation of agent traces where they label whether the agent performed well or poorly and explain why. Their input is not optional or cosmetic — it is foundational to agent quality.
What results can I expect from applying the Hetzel framework to my AI team?
Expect a team that optimises for functional performance rather than narrow technical metrics, catches agent failures across a broader surface area, and iterates faster because the right people own the right levers. Traditional enterprises will stop over-indexing on ML metrics and start including domain expertise structurally. AI natives will gain the rigour and guardrails they typically lack. Both will see improved eval coverage, stronger observability pipelines, and agents that actually solve the problem they were designed for.
Why shouldn't I let my ML team build AI agents on their own?
Handing agentic development entirely to ML teams is the most common Traditional Enterprise mistake. ML engineers default to what they know — training pipelines, precision/recall, cross-validation — but in agentic AI, the model is already built. The team's job is to implement, evaluate, and contextualise it. ML engineers alone will optimise for the wrong metrics, miss the importance of context engineering, and lack the domain proximity needed for meaningful eval and annotation. You need product engineers and domain experts alongside them.
Do I need to fine-tune a model to build an AI agent?
No — fine-tuning is rare and should only be pursued when the use case genuinely demands it. Most agent behaviour is changed via context engineering (adjusting prompts and inputs), not retraining. The Hetzel framework explicitly warns against treating fine-tuning as the default approach. If your team is working with pre-built LLM APIs from providers like OpenAI or Anthropic, fine-tuning is almost certainly unnecessary. Reserve it for cases where an open-source model must be adapted to a highly specialised domain.
Turn Any YouTube Video Into An AI Skill
SkillForge captures a creator's exact methodology from their video and turns it into a reusable AI skill you can invoke in Claude, ChatGPT, or any LLM.
Forge your own skill