Hetzel Agent Team Composition Framework

Design the right cross-functional team structure for building production-grade agentic AI applications by correctly positioning data scientists, engineers, and domain experts.

// TL;DR

The Hetzel Agent Team Composition Framework is a structured approach for designing cross-functional teams that build production-grade agentic AI applications. It prevents the common 'isolation mistake' of handing agent development entirely to data scientists just because it involves AI. Instead, it maps three essential roles — data scientists (guardrails/eval validation), systems engineers (infrastructure/orchestration), and domain experts (context engineering/annotation) — to specific responsibilities. Use it when your organization is deciding who should own, build, or govern an agentic AI initiative, or when diagnosing why an existing agent team is stuck before production.

// When should you use the Hetzel Agent Team Composition Framework?

Use this skill when an organisation is deciding who should own, build, or govern an agentic AI initiative — especially when a traditional ML or data science team has been handed agent development by default. Also use it when diagnosing why an existing agent team is struggling to reach production quality.

// What information do you need before applying the Hetzel framework?

  • Organisation typerequired
    Is this a Traditional Enterprise or an AI Native company? Determines baseline assumptions about team structure and existing tooling.
  • Agent use case descriptionrequired
    What problem is the agent meant to solve? Who are the end users? What does success look like functionally?
  • Current team compositionrequired
    Who is currently assigned to or owns the agentic development effort? Include roles and backgrounds.
  • Production vs. POC status
    Is the team trying to get to production, or are they still in proof-of-concept mode?
  • Fine-tuning requirement
    Does the use case require fine-tuning an open-source model, or is it purely prompt/context driven?

// What are the core principles behind the Hetzel Agent Team Composition Framework?

The Model Is Already Built

Unlike traditional ML, the foundational LLM has already been trained and deployed via an endpoint by Anthropic, OpenAI, Mistral, etc. The entire upstream data pipeline — ingestion, training, cross-validation, deployment — has already been done. Teams must stop recreating that workflow and focus instead on what comes after the API.

Proximity to the Problem

The person or role with the closest proximity to the problem the agent is trying to solve holds disproportionate value on an agent team. This is often NOT the ML engineer — it is the subject matter expert, product manager, or domain specialist who understands real user behaviour and intent.

Context Engineering, Not Feature Engineering

In traditional ML, behaviour is changed via feature engineering and retraining. In agentic AI, behaviour is changed by changing the inputs: the prompts, context, and instructions seeded into the model. This shift means the highest-leverage skill is context engineering, which can be performed by non-technical experts.

Broader Eval Surface Area

Agents require evaluating functional performance across the full agent trace, not just technical performance on a narrow two-class prediction problem. Fixating on precision, recall, and F1 is a category error when applied to agents — those metrics apply to the evaluators themselves, not to the agent's end-to-end behaviour.

LLM as Judge Needs Guardrails

LLM-as-judge is a powerful eval mechanism but is itself just a prompt and a model. Data scientists add unique value by creating labelled datasets and applying traditional recall/precision/F1 metrics to validate whether the LLM judge is actually agreeing with human judgement — preventing eval drift.

Diverse Team, Not Reassigned Team

The answer is never to hand agents entirely to data scientists nor to entirely exclude them. The mistake traditional enterprises make is isolating agent development to the ML/data science team because generative AI has 'AI' in the name. The correct move is to build a deliberately diverse team spanning technical and non-technical roles.

The Evals + Observability Feedback Loop

Agent quality requires two pillars: Evals (confidence-building during experimentation before production) and Observability (maintaining confidence once the agent faces real users in production). A team without both pillars will produce agents that pass internal tests but degrade in the wild.

// How do you apply the Hetzel framework step by step?

  1. 1

    Classify the organisation as Traditional Enterprise or AI Native

    Traditional Enterprise: existing ML/data science platform team likely owns or has been delegated agent work top-down. AI Native: small cross-functional engineering team, no legacy AI ownership, everyone is closer to the product. This classification sets the default risk — Traditional Enterprise teams are most likely to make the isolation mistake.

  2. 2

    Audit the current team composition against the three required roles

    Check whether the team has all three role types: (1) Data Scientists / ML Engineers, (2) Product, Application, and Systems Engineers, (3) Non-Technical Domain Experts / Subject Matter Experts / Product Managers. Flag any role type that is absent or underrepresented. A team that is 100% data scientists or 100% engineers is a red flag.

  3. 3

    Map the use case requirements to the role that owns each responsibility

    Use this mapping: Data Scientists own — guardrails/risk assessment, LLM-as-judge validation, labelled dataset creation, fine-tuning (if required). Product/Systems Engineers own — API integration, distributed systems architecture, infra for sub-agent orchestration, eval and observability pipeline implementation. Domain Experts/PMs own — prompt and context engineering, human annotation, defining what 'good' agent behaviour looks like and why.

  4. 4

    Determine whether fine-tuning is actually required

    Fine-tuning an open-source model is rare and represents the highest-leverage technical contribution for ML engineers. Before assigning significant data science resource, confirm the use case cannot be solved via context engineering alone. Most use cases can. Reserve fine-tuning work as a deliberate, scoped assignment — not a default.

  5. 5

    Assign data scientists specifically to guardrail and eval validation roles

    Data scientists should NOT be asked to simply 'own agents' generically. Their unique value is: (a) being the adult in the room on LLM risk — reminding the team that the LLM is just predicting token after token, not 'knowing' anything; (b) validating LLM-as-judge eval quality using labelled datasets and precision/recall/F1; (c) preventing the team from over-trusting outputs. If they are asked to do systems engineering or prompt engineering instead, that is a misallocation.

  6. 6

    Explicitly bring domain experts into the prompt and context engineering workflow

    Non-technical subject matter experts and PMs must have direct control over or significant input into the prompts and context seeded into the agent. They should also be the primary human annotators reviewing agent traces — because they are the ones who can judge whether the agent is performing well and, critically, WHY it is or is not performing well. Do not gate this behind a technical intermediary.

  7. 7

    Assign systems/product engineers to the distributed infrastructure and eval pipeline

    Agentic architectures often involve a supervisor agent calling child or sub-agents running on different infrastructure and calling different downstream systems. This is a complex systems problem, not a statistics problem. Product and application engineers should own this layer. They should also implement the evals and observability pipeline that closes the feedback loop between production data and offline experimentation datasets.

  8. 8

    Design the Evals + Observability feedback loop

    Evals belong to experimentation — used to build confidence before pushing to production. Observability belongs to production — used to maintain confidence as real users interact with the agent. The loop closes when production data is continuously harvested to update the offline eval dataset. Check: is grounded (human-labelled) data being added over time so that LLM-as-judge alignment with human agreement can be tracked and self-corrected?

  9. 9

    Re-evaluate team composition against the agent's functional performance criteria

    Do not use traditional ML metrics (precision, recall, F1) as the primary success criteria for the agent itself. Define functional performance criteria — what does the agent need to DO correctly, end-to-end, for a real user? Assign the domain experts to define these criteria. Then let data scientists validate the eval mechanism used to measure them.

// What does the Hetzel framework look like in real-world scenarios?

A large financial services firm has its ML platform team assigned to build a customer-facing AI agent for loan document processing. The team has strong Python and model training skills but limited exposure to LLM APIs or customer workflows.

Classify as Traditional Enterprise — isolation mistake is likely in progress. Audit reveals: strong ML coverage, no product/systems engineers on the team, no loan officers or underwriters (domain experts) involved. Assign ML engineers to guardrails and LLM-as-judge validation. Bring in a systems engineer to handle the multi-system tool-calling architecture. Pull loan officers into prompt and context engineering sessions — they have the closest proximity to the problem. Resist the urge to fine-tune; attempt context engineering first. Implement observability from day one so production traces can be human-annotated by underwriters.

An AI-native startup of five engineers is building an autonomous scheduling agent for healthcare clinics. Everyone codes; no one has a formal data science background.

Classify as AI Native — cross-functional proximity is a strength. Gap identified: no guardrails role — no one is stress-testing the LLM's statistical limitations or validating the LLM-as-judge eval quality. Recommend bringing in a fractional data scientist or assigning one engineer to own the guardrails and eval validation function. Ensure clinic staff (domain experts) are directly editing and owning the context/prompts seeded into the scheduling agent — they know edge cases engineers will never anticipate. Build a labelled dataset from the first 200 production interactions so eval quality can be measured against human agreement.

// What mistakes should you avoid when structuring an AI agent team?

  • The Isolation Mistake: handing agent development entirely to the ML/data science team because generative AI has 'AI' in the name — this misses the systems engineering complexity and the proximity-to-problem advantage of domain experts.
  • Applying traditional ML metrics (precision, recall, F1) as the primary success criteria for the agent itself. These metrics apply to evaluators and classifiers, not to the functional performance of a full agent trace.
  • Over-trusting LLM-as-judge outputs during evals without creating a labelled dataset to validate whether the judge is actually aligned with human agreement.
  • Treating fine-tuning as the default technical contribution of data scientists on agent teams — fine-tuning is rare; the more frequent and impactful contribution is guardrails and eval quality control.
  • Gating prompt and context engineering behind technical staff, when domain experts and product managers have the highest proximity to the problem and should directly own or co-own this layer.
  • Building strong evals but no observability — confidence built in experimentation degrades rapidly once the agent meets real users without a production monitoring loop.
  • Confusing proof-of-concept velocity with production readiness — many teams are prolific at building generative AI POCs but fail to implement the eval and observability pipelines needed to bring those POCs to production safely.

// What are the key terms in the Hetzel Agent Team Composition Framework?

Agent Quality Platform
A platform category (exemplified by BrainTrust) focused on maintaining confidence in agent execution through two pillars: Evals and Observability.
Evals
Evaluation processes performed during experimentation — before production — used to build confidence in how an agent will execute once deployed.
Agent Observability
Monitoring and analysis performed after an agent is in production to maintain confidence in its execution as it encounters real usage and real users.
The Model Is Already Built
Core principle that the foundational LLM training pipeline has already been executed by model providers (Anthropic, OpenAI, etc.), meaning agent teams must focus downstream of the model, not on replicating the training workflow.
Context Engineering
The practice of modifying agent behaviour by changing the prompts, context, and instructions fed to an already-trained model — the generative AI analogue to feature engineering in traditional ML.
Proximity to the Problem
The degree to which a team member understands what the end agent is actually meant to solve and how real users will interact with it. Higher proximity = higher leverage for prompt/context engineering and human annotation.
LLM as Judge
An eval technique where a language model is used to assess the quality of another model's outputs. Powerful but itself just a prompt and a model — must be validated against labelled human-agreement data.
Human Annotation Workflow
A structured process where domain experts review agent traces and label whether the agent is performing correctly and why — critical input for grounding evals in real-world quality standards.
Traditional Enterprise
An organisation with pre-existing ML/data science platform teams that typically inherits agent development top-down, often making the isolation mistake of keeping it within that existing team.
AI Native
An organisation that built its offering around agents from the start, with small cross-functional engineering teams, no legacy AI ownership structures, and high per-person proximity to the problem.
The Isolation Mistake
The error of assigning agent development exclusively to ML/data science teams because generative AI has 'AI' in the name, ignoring the systems engineering and domain expertise requirements of production agents.
Guardrails Role
The function — best filled by data scientists — of providing statistical rigour and risk awareness to agent teams, reminding them of LLM limitations and preventing over-trust in model outputs.
Broader Eval Surface Area
The recognition that agents must be evaluated on functional performance across the full agent trace, not just narrow technical metrics on a classification problem — a substantially wider scope than traditional ML evaluation.

// FREQUENTLY ASKED QUESTIONS

What is the Hetzel Agent Team Composition Framework?

The Hetzel Agent Team Composition Framework is a methodology for structuring cross-functional teams that build production-grade agentic AI applications. Developed from insights by Phil Hetzel of Braintrust, it defines three essential roles — data scientists, systems/product engineers, and domain experts — and maps each to specific responsibilities like guardrail validation, distributed infrastructure, and context engineering. It corrects the common enterprise mistake of isolating agent work within ML teams.

What is the isolation mistake in AI agent team building?

The isolation mistake is the error of assigning agent development exclusively to an ML or data science team because generative AI has 'AI' in the name. This ignores that foundational LLMs are already trained and deployed by providers like OpenAI and Anthropic. Agent work requires systems engineering for orchestration, domain expertise for context engineering, and data science for guardrails — not a single team replicating the traditional ML pipeline.

How do you build a team for an AI agent project?

Start by classifying your organization as Traditional Enterprise or AI Native, then audit your current team against three required role types: data scientists/ML engineers, product/systems engineers, and non-technical domain experts. Assign data scientists to guardrails and eval validation, engineers to distributed infrastructure and observability pipelines, and domain experts to prompt/context engineering and human annotation. Flag any missing role type as a critical gap.

How do you decide if you need fine-tuning for an AI agent?

Before assigning significant ML engineering resources to fine-tuning, confirm the use case cannot be solved via context engineering alone — most can. Fine-tuning an open-source model is rare and represents the highest-leverage technical work for ML engineers, but it should be a deliberate, scoped assignment rather than a default. Start with prompt and context optimization, measure performance, and only escalate to fine-tuning if context engineering fails.

How does the Hetzel framework compare to just having data scientists build AI agents?

Having only data scientists build agents misses two critical capabilities: systems engineering for multi-agent orchestration and API integration, and domain expertise for context engineering and functional evaluation. The Hetzel framework positions data scientists specifically where they add unique value — guardrails, LLM-as-judge validation, and statistical rigor — while engineers handle infrastructure and domain experts own the prompts and define what good agent behavior looks like.

When should I use the Hetzel Agent Team Composition Framework?

Use this framework when your organization is deciding who should own, build, or govern an agentic AI initiative — especially if a traditional ML or data science team has been handed agent work by default. It is also valuable when diagnosing why an existing agent team is struggling to move from proof-of-concept to production, or when an agent passes internal tests but degrades with real users.

What results can I expect from applying the Hetzel framework to my agent team?

You can expect faster progression from proof-of-concept to production, fewer misallocated roles, and higher agent quality in real-world use. Domain experts will catch edge cases engineers miss, data scientists will prevent eval drift by validating LLM-as-judge accuracy, and engineers will build the infrastructure for reliable orchestration and monitoring. The evals-plus-observability feedback loop ensures agent performance is sustained post-deployment.

What is context engineering in agentic AI?

Context engineering is the practice of modifying agent behavior by changing the prompts, context, and instructions fed to an already-trained model. It is the generative AI analogue to feature engineering in traditional ML. The key insight is that non-technical domain experts and product managers can perform context engineering because they have the highest proximity to the problem the agent is solving.

Why shouldn't you use precision recall and F1 to evaluate AI agents?

Precision, recall, and F1 measure performance on narrow classification problems, not the functional end-to-end behavior of an agent across a full trace. Using them as primary agent success criteria is a category error. These metrics do apply to evaluating the LLM-as-judge mechanism itself — checking whether the judge agrees with human labels — but the agent's overall quality must be measured against functional performance criteria defined by domain experts.

What is LLM as judge and why does it need guardrails?

LLM-as-judge is an evaluation technique where a language model assesses the quality of another model's outputs. It is powerful but is itself just a prompt and a model, meaning it can drift from human judgment over time. Data scientists add critical value by creating labelled datasets and applying precision/recall/F1 metrics to validate whether the LLM judge actually agrees with human annotators — preventing undetected eval drift.

// GET STARTED

Turn Any YouTube Video Into An AI Skill

SkillForge captures a creator's exact methodology from their video and turns it into a reusable AI skill you can invoke in Claude, ChatGPT, or any LLM.

Forge your own skill