Hetzel Agent Team Composition Framework

Last updated: 25 May 2026

Design the right cross-functional team mix to build production-ready agentic AI systems by applying Phil Hetzel's diagnostic for who should own, build, and evaluate agents in your organisation.

// TL;DR

The Hetzel Agent Team Composition Framework is a diagnostic for deciding who should own, build, and evaluate production agentic AI systems. Based on Phil Hetzel's analysis, it maps three essential personas — data scientists, product/systems engineers, and non-technical domain experts — to their highest-value roles in agent development. Use it whenever your organisation is staffing an agentic AI initiative, especially when an existing ML team has been handed the mandate by default or when you're building a cross-functional agent team from scratch. It prevents the common mistake of treating agents like traditional ML projects.

Framework

// When should you use the Hetzel Agent Team Composition Framework?

Use this skill whenever a team is deciding who should own agentic AI development — especially when an existing ML or data science team has been handed agent-building responsibility by default, or when a new AI initiative is being staffed from scratch.

// What inputs do you need before applying the Hetzel framework?

Organisation Typerequired
Is the organisation a Traditional Enterprise (existing ML/DS teams, delegated mandate) or an AI Native (built around agents from the start)?
Agent Use Caserequired
What problem is the agent meant to solve? Who are the end users?
Current Team Compositionrequired
Who is currently assigned to build the agent — data scientists, ML engineers, product engineers, subject matter experts, PMs, or some mix?
Agent Architecture Complexity
Is this a single agent or a distributed multi-agent system (supervisor + sub-agents on different infrastructure)?
Fine-Tuning Requirement
Does the use case require fine-tuning an open-source model, or is the team working entirely with pre-built LLM APIs?

// What are the core principles behind the Hetzel Agent Team Composition Framework?

The Model Is Already Built

Unlike traditional ML, the data pipeline of training and testing has already been done by Anthropic, OpenAI, Mistral, et al. The team's job is not to build a model — it is to implement, evaluate, and contextualise one. This fundamentally changes which skills are most valuable.

Proximity to the Problem

The people closest to the problem the agent is meant to solve — domain experts, subject matter experts, product managers — hold disproportionate value in agentic development. Proximity to the problem determines quality of context engineering and human annotation.

Functional Performance Over Technical Metrics

Evaluating agents requires assessing functional performance across a far broader surface area than traditional ML. Locking on to precision, recall, and F1 alone is a trap — those are technical metrics for a two-box ML pipeline, not for agentic behaviour.

LLMs Are Just APIs

From a product engineering perspective, LLMs are APIs: send a payload, receive a response, make it useful to the end user. This reframe opens agent development to product and application engineers who are deeply experienced in exactly this pattern.

Context Engineering, Not Feature Engineering

The primary lever for changing agent behaviour is changing the inputs — prompts and context — rather than retraining or feature engineering. This shifts meaningful creative control toward those with deep domain knowledge, not just technical depth.

The Answer Is Always in the Middle

No single discipline owns agents. The ideal team is deliberately diverse: data scientists, product/application/systems engineers, and non-technical domain experts each contribute irreplaceable value at different stages of building and evaluating agents.

// How do you apply the Hetzel Agent Team Composition Framework step by step?

1
Classify the organisation type
Determine whether this is a Traditional Enterprise (ML/DS team handed the agent mandate top-down) or an AI Native (small, cross-functional, agile team built around agents). The classification shapes the default risk: Traditional Enterprise teams tend to over-index on ML metrics and under-include non-technical experts; AI Natives risk under-engineering rigour and guardrails.
2
Audit the current team composition for coverage gaps
Map current team members against three personas: (1) Data Scientists / ML Engineers, (2) Product / Application / Systems Engineers, (3) Non-Technical Domain Experts or Subject Matter Experts. Identify which personas are missing or marginalised. A team staffed entirely by ML engineers is a warning sign.
3
Assign Data Scientists / ML Engineers their agent-specific role
Their role is NOT to train the model — that is already done. Assign them to: (a) act as the 'adult in the room' on LLM risk and statistical literacy, (b) validate LLM-as-judge evals against labelled datasets using recall, precision, and F1, and (c) lead fine-tuning of open-source models if the use case genuinely requires it. Redirect them away from obsessing over traditional ML metrics as the primary eval signal.
4
Assign Product / Application / Systems Engineers their agent-specific role
These engineers implement requirements into the product, manage the systems and infrastructure where agents execute (especially critical for distributed multi-agent architectures with supervisor and sub-agents on different compute), and build the eval and observability pipelines that close the feedback loop between production and experimentation.
5
Assign Non-Technical Domain Experts their agent-specific role
These people have the most proximity to the problem. Give them meaningful control over prompt and context engineering — the primary lever for changing agent behaviour. Also deploy them in human annotation workflows: they should review agent traces and label whether the agent performed well or poorly, and critically, explain WHY. Do not treat this as optional or cosmetic.
6
Define the eval and observability pipeline jointly
Evals (pre-production experimentation) and observability (post-production monitoring) are the two pillars of agent quality. The team must agree on what 'good' looks like functionally — not just technically. Use production data to continuously expand the offline evaluation dataset. Check whether LLM-as-judge evals are converging toward or diverging from human agreement over time.
7
Pressure-test the team against the 'Proximity to the Problem' principle
Ask: does the team have at least one person who deeply understands what the end agent is actually meant to solve? If the team is entirely engineers with no domain expert involved, the agent will lack the contextual grounding needed to be relevant. Fix this before building, not after.

// What does the Hetzel framework look like in real-world scenarios?

A large financial services firm assigns its existing ML platform team to build a customer-facing agent that answers account queries, because 'it has AI in the name'.

Classify as Traditional Enterprise. Audit the team — likely heavy on ML engineers, missing product engineers and domain experts (customer service specialists, compliance officers). Redirect ML engineers to guardrail and eval validation roles. Bring in product engineers to manage the LLM-as-API integration and systems infrastructure. Recruit customer service SMEs for prompt/context engineering and human annotation of agent traces. Redefine eval criteria beyond precision/recall to include functional performance — does the agent actually resolve the customer's query correctly and safely?

An AI-native startup building a legal research agent has a small team of generalist engineers who are moving fast but have no formal eval process.

Classify as AI Native. The proximity-to-the-problem advantage is present — engineers are close to the use case. The gap is rigour. Add a data scientist or someone with a stats background to build guardrails and design LLM-as-judge eval pipelines validated against labelled data. Formalise human annotation by involving a legal domain expert who reviews agent traces and labels correctness with reasoning. Build an observability pipeline so production behaviour feeds back into the offline eval dataset continuously.

// What mistakes should you avoid when staffing an agentic AI team?

Handing agentic development entirely to ML or data science teams because 'it has AI in the name' — this is the most common Traditional Enterprise mistake and leads to teams optimising for the wrong metrics.
Obsessing over precision, recall, and F1 as the primary eval signals for agents — these are technical metrics suited to a two-box ML pipeline, not to the broad functional surface area of agentic behaviour.
Ignoring non-technical domain experts or treating their input as cosmetic — these people hold the most proximity to the problem and are the primary contributors to prompt/context engineering and human annotation quality.
Treating fine-tuning as the default approach — fine-tuning open-source models is rare and should only be pursued when the use case genuinely demands it; most agent behaviour is changed via context engineering, not retraining.
Skipping the observability pipeline post-production — confidence in an agent built in experimentation does not transfer automatically to production; real usage confronts the agent with scenarios that evals did not anticipate.
Letting LLM-as-judge evals run unchecked without validating them against human-labelled ground truth — judges are just prompts and models; they can drift from human agreement without a self-check mechanism.
Treating agents as just another predictive model and applying the traditional cross-validation and AB-testing dance — the pipeline is entirely different once the model is already built.

// What are the key terms in the Hetzel Agent Team Composition Framework?

Agent Quality: The discipline of ensuring an agent performs correctly both before and after production, comprising two pillars: evals and observability.
Evals: Evaluations performed during experimentation and development to build confidence in an agent's execution before it is pushed to production.
Agent Observability: Monitoring an agent's behaviour after it is in production to maintain confidence in its execution as it encounters real users and real usage.
Proximity to the Problem: The degree to which a team member understands what the end agent is actually meant to solve. Higher proximity — typically found in domain experts and SMEs — leads to better context engineering and annotation quality.
Context Engineering: The primary lever for changing agent behaviour: adjusting the prompts, context, and inputs fed to a pre-built LLM rather than retraining or feature engineering.
LLM-as-Judge: Using a language model to evaluate the outputs of an agent as part of the eval process. Requires validation against labelled datasets to ensure the judge itself is trustworthy.
Human Annotation Workflow: A structured process in which domain experts review agent traces and label whether the agent performed well or poorly — and explain why — to generate grounded training and evaluation signal.
Agent Trace: A logged record of an agent's execution steps, decisions, and outputs that can be reviewed by technical or non-technical evaluators.
Traditional Enterprise: An organisation that approaches agentic development by delegating it to an existing ML or data science platform team, typically because generative AI is categorised as an 'AI problem'.
AI Native: An organisation that built its entire offering around agents from the start, typically characterised by small, cross-functional, agile teams with high proximity to the problem and no legacy ML platform.
Functional Performance: Evaluation of whether an agent actually accomplishes its intended purpose for real users — as distinct from technical performance metrics like precision, recall, and F1.
Distributed Agent / Supervisor + Sub-Agents: A multi-agent architecture in which a supervisor agent orchestrates multiple child or sub-agents running on different infrastructure, calling different systems — a complex systems engineering problem.

// FREQUENTLY ASKED QUESTIONS

What is the Hetzel Agent Team Composition Framework?

The Hetzel Agent Team Composition Framework is a diagnostic for designing the right cross-functional team to build production-ready agentic AI systems. It identifies three essential personas — data scientists/ML engineers, product/application/systems engineers, and non-technical domain experts — and assigns each a specific role in agent development, evaluation, and observability. It was developed from Phil Hetzel's analysis of how traditional enterprises and AI-native startups commonly misstaff agent projects.

What is context engineering in agentic AI?

Context engineering is the primary lever for changing agent behaviour: adjusting the prompts, context, and inputs fed to a pre-built LLM rather than retraining or feature engineering. In the Hetzel framework, context engineering is the responsibility of non-technical domain experts who have the highest proximity to the problem the agent solves. This shifts creative control toward people with deep subject-matter knowledge, not just technical depth.

How do I decide who should build our AI agent?

Start by classifying your organisation as a Traditional Enterprise or AI Native, then audit your current team for coverage across three personas: data scientists, product engineers, and domain experts. Each persona has a distinct role — data scientists handle eval validation and risk, product engineers manage LLM-as-API integration and infrastructure, and domain experts own context engineering and human annotation. If any persona is missing, fill that gap before building.

How do I staff a cross-functional team for agentic AI development?

Map every team member to one of three personas: data scientist/ML engineer, product/application/systems engineer, or non-technical domain expert. Assign data scientists to eval validation and guardrails, not model training. Assign product engineers to LLM integration, infrastructure, and observability pipelines. Give domain experts meaningful control over prompt engineering and human annotation workflows. If a persona is absent, recruit for it — a team missing domain experts will build contextually weak agents.

How does the Hetzel framework compare to just assigning agents to your ML team?

Assigning agents to your ML team alone is the most common staffing mistake in traditional enterprises. The Hetzel framework argues that unlike traditional ML, the model is already built — the team's job is to implement, evaluate, and contextualise it. ML engineers optimise for precision, recall, and F1, but agent quality requires functional performance evaluation across a much broader surface area. The framework adds product engineers and domain experts as co-equals, not optional extras.

When should I use the Hetzel Agent Team Composition Framework?

Use it whenever your organisation is deciding who should own agentic AI development. The most critical moments are: when an existing ML or data science team has been handed agent-building responsibility by default, when a new AI initiative is being staffed from scratch, or when an agent project is struggling and you suspect the team composition is the root cause. It applies to both traditional enterprises and AI-native startups.

What results can I expect from applying the Hetzel framework to my agent team?

You can expect three outcomes: better agent quality from domain experts contributing to context engineering and annotation, more robust evaluations from data scientists focused on guardrails rather than model training, and stronger production reliability from product engineers owning observability and infrastructure. Teams that apply this framework avoid the trap of optimising for technical metrics while missing functional performance — whether the agent actually solves the user's problem.

What's the difference between evals and observability for AI agents?

Evals are evaluations performed during experimentation and development to build confidence in an agent before production. Observability is monitoring the agent's behaviour after deployment to maintain confidence as it encounters real users. Together they form the two pillars of agent quality. The Hetzel framework requires the full cross-functional team to define both jointly, ensuring functional performance criteria are set collaboratively, not just by engineers.

Why are domain experts so important for building AI agents?

Domain experts hold the most proximity to the problem the agent is meant to solve. In the Hetzel framework, they are the primary contributors to context engineering — the main lever for changing agent behaviour — and to human annotation workflows where they review agent traces and label performance with reasoning. Without domain experts, agents lack the contextual grounding needed to be relevant. Their role is not cosmetic; it is structurally essential.

Should I fine-tune a model for my AI agent?

Fine-tuning should not be the default approach. Most agent behaviour is changed via context engineering — adjusting prompts and inputs — not retraining. The Hetzel framework treats fine-tuning of open-source models as a rare case pursued only when the use case genuinely demands it. Data scientists should lead fine-tuning if required, but the team should exhaust context engineering options first. Defaulting to fine-tuning wastes resources and introduces unnecessary complexity.

// GET THIS SKILL — FREE