Hetzel Agent Team Composition Framework

Last updated: 26 May 2026

Design the right cross-functional team structure for building production-grade agentic AI applications by correctly positioning data scientists, engineers, and domain experts.

// TL;DR

The Hetzel Agent Team Composition Framework is a structured approach for designing cross-functional teams that build production-grade agentic AI applications. It corrects the common mistake of handing agent development entirely to data scientists or ML engineers by mapping three essential roles — data scientists (guardrails and eval validation), product/systems engineers (infrastructure and orchestration), and domain experts (context engineering and human annotation). Use it when deciding who should own an agentic AI initiative, when diagnosing why an agent team is stuck in POC mode, or when restructuring a team that has been assembled by default rather than by design.

Framework

// When should you use the Hetzel Agent Team Composition Framework?

Use this skill when an organisation is deciding who should own, build, or govern an agentic AI initiative — especially when a traditional ML or data science team has been handed agent development by default. Also use it when diagnosing why an existing agent team is struggling to reach production quality.

// What inputs do you need before applying the Hetzel Agent Team Composition Framework?

Organisation typerequired
Is this a Traditional Enterprise or an AI Native company? Determines baseline assumptions about team structure and existing tooling.
Agent use case descriptionrequired
What problem is the agent meant to solve? Who are the end users? What does success look like functionally?
Current team compositionrequired
Who is currently assigned to or owns the agentic development effort? Include roles and backgrounds.
Production vs. POC status
Is the team trying to get to production, or are they still in proof-of-concept mode?
Fine-tuning requirement
Does the use case require fine-tuning an open-source model, or is it purely prompt/context driven?

// What are the core principles behind the Hetzel Agent Team Composition Framework?

The Model Is Already Built

Unlike traditional ML, the foundational LLM has already been trained and deployed via an endpoint by Anthropic, OpenAI, Mistral, etc. The entire upstream data pipeline — ingestion, training, cross-validation, deployment — has already been done. Teams must stop recreating that workflow and focus instead on what comes after the API.

Proximity to the Problem

The person or role with the closest proximity to the problem the agent is trying to solve holds disproportionate value on an agent team. This is often NOT the ML engineer — it is the subject matter expert, product manager, or domain specialist who understands real user behaviour and intent.

Context Engineering, Not Feature Engineering

In traditional ML, behaviour is changed via feature engineering and retraining. In agentic AI, behaviour is changed by changing the inputs: the prompts, context, and instructions seeded into the model. This shift means the highest-leverage skill is context engineering, which can be performed by non-technical experts.

Broader Eval Surface Area

Agents require evaluating functional performance across the full agent trace, not just technical performance on a narrow two-class prediction problem. Fixating on precision, recall, and F1 is a category error when applied to agents — those metrics apply to the evaluators themselves, not to the agent's end-to-end behaviour.

LLM as Judge Needs Guardrails

LLM-as-judge is a powerful eval mechanism but is itself just a prompt and a model. Data scientists add unique value by creating labelled datasets and applying traditional recall/precision/F1 metrics to validate whether the LLM judge is actually agreeing with human judgement — preventing eval drift.

Diverse Team, Not Reassigned Team

The answer is never to hand agents entirely to data scientists nor to entirely exclude them. The mistake traditional enterprises make is isolating agent development to the ML/data science team because generative AI has 'AI' in the name. The correct move is to build a deliberately diverse team spanning technical and non-technical roles.

The Evals + Observability Feedback Loop

Agent quality requires two pillars: Evals (confidence-building during experimentation before production) and Observability (maintaining confidence once the agent faces real users in production). A team without both pillars will produce agents that pass internal tests but degrade in the wild.

// How do you apply the Hetzel Agent Team Composition Framework step by step?

1
Classify the organisation as Traditional Enterprise or AI Native
Traditional Enterprise: existing ML/data science platform team likely owns or has been delegated agent work top-down. AI Native: small cross-functional engineering team, no legacy AI ownership, everyone is closer to the product. This classification sets the default risk — Traditional Enterprise teams are most likely to make the isolation mistake.
2
Audit the current team composition against the three required roles
Check whether the team has all three role types: (1) Data Scientists / ML Engineers, (2) Product, Application, and Systems Engineers, (3) Non-Technical Domain Experts / Subject Matter Experts / Product Managers. Flag any role type that is absent or underrepresented. A team that is 100% data scientists or 100% engineers is a red flag.
3
Map the use case requirements to the role that owns each responsibility
Use this mapping: Data Scientists own — guardrails/risk assessment, LLM-as-judge validation, labelled dataset creation, fine-tuning (if required). Product/Systems Engineers own — API integration, distributed systems architecture, infra for sub-agent orchestration, eval and observability pipeline implementation. Domain Experts/PMs own — prompt and context engineering, human annotation, defining what 'good' agent behaviour looks like and why.
4
Determine whether fine-tuning is actually required
Fine-tuning an open-source model is rare and represents the highest-leverage technical contribution for ML engineers. Before assigning significant data science resource, confirm the use case cannot be solved via context engineering alone. Most use cases can. Reserve fine-tuning work as a deliberate, scoped assignment — not a default.
5
Assign data scientists specifically to guardrail and eval validation roles
Data scientists should NOT be asked to simply 'own agents' generically. Their unique value is: (a) being the adult in the room on LLM risk — reminding the team that the LLM is just predicting token after token, not 'knowing' anything; (b) validating LLM-as-judge eval quality using labelled datasets and precision/recall/F1; (c) preventing the team from over-trusting outputs. If they are asked to do systems engineering or prompt engineering instead, that is a misallocation.
6
Explicitly bring domain experts into the prompt and context engineering workflow
Non-technical subject matter experts and PMs must have direct control over or significant input into the prompts and context seeded into the agent. They should also be the primary human annotators reviewing agent traces — because they are the ones who can judge whether the agent is performing well and, critically, WHY it is or is not performing well. Do not gate this behind a technical intermediary.
7
Assign systems/product engineers to the distributed infrastructure and eval pipeline
Agentic architectures often involve a supervisor agent calling child or sub-agents running on different infrastructure and calling different downstream systems. This is a complex systems problem, not a statistics problem. Product and application engineers should own this layer. They should also implement the evals and observability pipeline that closes the feedback loop between production data and offline experimentation datasets.
8
Design the Evals + Observability feedback loop
Evals belong to experimentation — used to build confidence before pushing to production. Observability belongs to production — used to maintain confidence as real users interact with the agent. The loop closes when production data is continuously harvested to update the offline eval dataset. Check: is grounded (human-labelled) data being added over time so that LLM-as-judge alignment with human agreement can be tracked and self-corrected?
9
Re-evaluate team composition against the agent's functional performance criteria
Do not use traditional ML metrics (precision, recall, F1) as the primary success criteria for the agent itself. Define functional performance criteria — what does the agent need to DO correctly, end-to-end, for a real user? Assign the domain experts to define these criteria. Then let data scientists validate the eval mechanism used to measure them.

// What does the Hetzel Agent Team Composition Framework look like in practice?

A large financial services firm has its ML platform team assigned to build a customer-facing AI agent for loan document processing. The team has strong Python and model training skills but limited exposure to LLM APIs or customer workflows.

Classify as Traditional Enterprise — isolation mistake is likely in progress. Audit reveals: strong ML coverage, no product/systems engineers on the team, no loan officers or underwriters (domain experts) involved. Assign ML engineers to guardrails and LLM-as-judge validation. Bring in a systems engineer to handle the multi-system tool-calling architecture. Pull loan officers into prompt and context engineering sessions — they have the closest proximity to the problem. Resist the urge to fine-tune; attempt context engineering first. Implement observability from day one so production traces can be human-annotated by underwriters.

An AI-native startup of five engineers is building an autonomous scheduling agent for healthcare clinics. Everyone codes; no one has a formal data science background.

Classify as AI Native — cross-functional proximity is a strength. Gap identified: no guardrails role — no one is stress-testing the LLM's statistical limitations or validating the LLM-as-judge eval quality. Recommend bringing in a fractional data scientist or assigning one engineer to own the guardrails and eval validation function. Ensure clinic staff (domain experts) are directly editing and owning the context/prompts seeded into the scheduling agent — they know edge cases engineers will never anticipate. Build a labelled dataset from the first 200 production interactions so eval quality can be measured against human agreement.

// What mistakes should you avoid when structuring an agentic AI team?

The Isolation Mistake: handing agent development entirely to the ML/data science team because generative AI has 'AI' in the name — this misses the systems engineering complexity and the proximity-to-problem advantage of domain experts.
Applying traditional ML metrics (precision, recall, F1) as the primary success criteria for the agent itself. These metrics apply to evaluators and classifiers, not to the functional performance of a full agent trace.
Over-trusting LLM-as-judge outputs during evals without creating a labelled dataset to validate whether the judge is actually aligned with human agreement.
Treating fine-tuning as the default technical contribution of data scientists on agent teams — fine-tuning is rare; the more frequent and impactful contribution is guardrails and eval quality control.
Gating prompt and context engineering behind technical staff, when domain experts and product managers have the highest proximity to the problem and should directly own or co-own this layer.
Building strong evals but no observability — confidence built in experimentation degrades rapidly once the agent meets real users without a production monitoring loop.
Confusing proof-of-concept velocity with production readiness — many teams are prolific at building generative AI POCs but fail to implement the eval and observability pipelines needed to bring those POCs to production safely.

// What key terms should you know for the Hetzel Agent Team Composition Framework?

Agent Quality Platform: A platform category (exemplified by BrainTrust) focused on maintaining confidence in agent execution through two pillars: Evals and Observability.
Evals: Evaluation processes performed during experimentation — before production — used to build confidence in how an agent will execute once deployed.
Agent Observability: Monitoring and analysis performed after an agent is in production to maintain confidence in its execution as it encounters real usage and real users.
The Model Is Already Built: Core principle that the foundational LLM training pipeline has already been executed by model providers (Anthropic, OpenAI, etc.), meaning agent teams must focus downstream of the model, not on replicating the training workflow.
Context Engineering: The practice of modifying agent behaviour by changing the prompts, context, and instructions fed to an already-trained model — the generative AI analogue to feature engineering in traditional ML.
Proximity to the Problem: The degree to which a team member understands what the end agent is actually meant to solve and how real users will interact with it. Higher proximity = higher leverage for prompt/context engineering and human annotation.
LLM as Judge: An eval technique where a language model is used to assess the quality of another model's outputs. Powerful but itself just a prompt and a model — must be validated against labelled human-agreement data.
Human Annotation Workflow: A structured process where domain experts review agent traces and label whether the agent is performing correctly and why — critical input for grounding evals in real-world quality standards.
Traditional Enterprise: An organisation with pre-existing ML/data science platform teams that typically inherits agent development top-down, often making the isolation mistake of keeping it within that existing team.
AI Native: An organisation that built its offering around agents from the start, with small cross-functional engineering teams, no legacy AI ownership structures, and high per-person proximity to the problem.
The Isolation Mistake: The error of assigning agent development exclusively to ML/data science teams because generative AI has 'AI' in the name, ignoring the systems engineering and domain expertise requirements of production agents.
Guardrails Role: The function — best filled by data scientists — of providing statistical rigour and risk awareness to agent teams, reminding them of LLM limitations and preventing over-trust in model outputs.
Broader Eval Surface Area: The recognition that agents must be evaluated on functional performance across the full agent trace, not just narrow technical metrics on a classification problem — a substantially wider scope than traditional ML evaluation.

// FREQUENTLY ASKED QUESTIONS

What is the Hetzel Agent Team Composition Framework?

The Hetzel Agent Team Composition Framework is a method for designing cross-functional teams that build production-grade AI agents. It defines three essential roles — data scientists for guardrails and eval validation, product/systems engineers for infrastructure and orchestration, and domain experts for context engineering and human annotation — and maps each to specific responsibilities. It corrects the common enterprise mistake of isolating agent work within ML teams by emphasizing proximity to the problem and context engineering over traditional model training workflows.

What is the Isolation Mistake in AI agent team building?

The Isolation Mistake is the error of assigning agent development exclusively to ML or data science teams because generative AI has 'AI' in the name. This ignores two critical requirements: the systems engineering complexity of multi-agent orchestration and the proximity-to-problem advantage that domain experts bring to prompt and context engineering. The Hetzel framework identifies this as the single most common structural failure in traditional enterprises building agents.

How do I decide who should own an agentic AI project in my company?

Start by classifying your organization as Traditional Enterprise or AI Native, then audit your current team against three required roles: data scientists, product/systems engineers, and domain experts. The Hetzel framework assigns ownership based on proximity to the problem — domain experts and PMs typically hold disproportionate value because they understand real user behavior. Data scientists own guardrails and eval validation, while engineers own infrastructure and the eval-observability pipeline. No single role should own the entire initiative.

How do you apply the Hetzel Agent Team Composition Framework step by step?

Follow nine steps: classify your org type, audit current team composition against three required roles, map responsibilities to roles, determine if fine-tuning is actually needed, assign data scientists to guardrails and eval validation, bring domain experts into prompt and context engineering, assign engineers to distributed infrastructure, design the evals-plus-observability feedback loop, and re-evaluate team composition against functional performance criteria. Each step has a specific deliverable and a named role owner.

How does the Hetzel framework compare to just having data scientists build AI agents?

Having only data scientists build agents misses two critical dimensions. First, agentic architectures involve supervisor agents calling sub-agents across distributed systems — this is a systems engineering problem, not a statistics problem. Second, agent behavior is changed through context engineering (prompts and instructions), not feature engineering or retraining — and domain experts with proximity to the problem are better positioned for that work. The Hetzel framework deliberately combines all three roles rather than defaulting to any single one.

When should I use the Hetzel Agent Team Composition Framework?

Use it when your organization is deciding who should own, build, or govern an agentic AI initiative — especially when a traditional ML or data science team has been assigned agent work by default. Also use it when diagnosing why an existing agent team is struggling to move from proof-of-concept to production. It is particularly valuable for traditional enterprises where legacy AI team structures create a high risk of the Isolation Mistake.

What results can I expect from using the Hetzel Agent Team Composition Framework?

You can expect faster progression from POC to production, higher-quality agent behavior grounded in real domain expertise, and more reliable eval pipelines. Teams restructured using this framework avoid the common trap of building impressive demos that fail in production. The dual evals-plus-observability feedback loop ensures that confidence built during experimentation is maintained once the agent faces real users, and eval drift is caught early through labelled-dataset validation.

What is context engineering in agentic AI and why does it matter for team composition?

Context engineering is the practice of modifying agent behavior by changing the prompts, context, and instructions fed to an already-trained model — it is the generative AI analogue of feature engineering in traditional ML. It matters for team composition because it can be performed by non-technical domain experts and product managers who have the highest proximity to the problem. Gating this work behind technical staff is a misallocation identified by the Hetzel framework as a key pitfall.

What role do data scientists play on an agentic AI team?

Data scientists play a specific, high-leverage role: guardrails and eval validation. They remind the team that the LLM is predicting tokens, not 'knowing' anything. They create labelled datasets and apply precision, recall, and F1 metrics to validate whether LLM-as-judge evaluators actually align with human judgment. They also own fine-tuning when it is genuinely required. They should not be asked to own agents generically or to perform systems engineering or prompt engineering — those are misallocations.

Do I need to fine-tune a model to build an AI agent?

No — most agentic AI use cases can be solved through context engineering alone, without fine-tuning. The Hetzel framework recommends confirming that context engineering cannot solve the problem before assigning significant data science resource to fine-tuning. Fine-tuning an open-source model is rare and represents the highest-leverage technical contribution for ML engineers when it is needed, but it should be a deliberate, scoped assignment rather than a default assumption.

What is the difference between evals and observability for AI agents?

Evals are evaluation processes performed during experimentation — before production — used to build confidence in how an agent will perform once deployed. Observability is monitoring and analysis performed after an agent is in production to maintain confidence as it encounters real users. The Hetzel framework treats these as two pillars that form a feedback loop: production data from observability is harvested to update the offline eval dataset, ensuring eval quality tracks real-world conditions over time.

// GET THIS SKILL — FREE