Question 1

What is the Hetzel Agent Team Composition Framework in simple terms?

Accepted Answer

It is a method for building the right team to create production AI agents. Instead of letting one group (usually data scientists) own everything, it splits responsibilities across three roles: data scientists handle guardrails and eval quality, engineers handle infrastructure and orchestration, and domain experts handle prompts and context engineering. The framework was articulated by Phil Hetzel of Braintrust and addresses the structural failures he observed in enterprise agent teams.

Question 2

What is proximity to the problem and why does it matter for AI agent teams?

Accepted Answer

Proximity to the problem measures how closely a team member understands the real-world task the agent is trying to solve and how end users will interact with it. It matters because in agentic AI, behavior is controlled through prompts and context — not model retraining. The person closest to the problem (often a domain expert or PM, not an ML engineer) can most effectively craft those inputs. Higher proximity equals higher leverage on agent quality.

Question 3

What does 'the model is already built' mean for agent development teams?

Accepted Answer

It means the foundational LLM — including the entire upstream data pipeline of ingestion, training, cross-validation, and deployment — has already been completed by providers like Anthropic, OpenAI, or Mistral. Agent teams access this via an API. This fundamentally changes what skills are needed: instead of training models, teams need to focus on context engineering, systems integration, evaluation, and observability. Data scientists who try to recreate the training workflow are misallocating effort.

Question 4

How do I audit my current agent team using the Hetzel framework?

Accepted Answer

Check whether your team has all three required role types: (1) data scientists or ML engineers, (2) product, application, and systems engineers, and (3) non-technical domain experts, subject matter experts, or product managers. Flag any role type that is absent or underrepresented. A team that is 100% data scientists or 100% engineers is a red flag. Then verify each role is assigned to the right responsibilities — data scientists to guardrails, engineers to infrastructure, domain experts to context engineering.

Question 5

How do I bring domain experts into prompt engineering if they're not technical?

Accepted Answer

Give domain experts direct control over or significant input into the prompts and context seeded into the agent. They should also serve as primary human annotators reviewing agent traces, because they can judge whether the agent is performing well and why. Do not gate this behind a technical intermediary. Use collaborative tools where domain experts can edit prompts, review outputs, and label agent behavior without needing to write code. Their proximity to the problem is the highest-leverage asset on the team.

Question 6

How do I set up the evals and observability feedback loop for AI agents?

Accepted Answer

Evals belong to experimentation — run them before production to build confidence. Observability belongs to production — monitor agent behavior as real users interact with it. Close the loop by continuously harvesting production data to update your offline eval dataset. Critically, create labelled human-agreement data over time so you can track whether your LLM-as-judge evaluator is drifting from real human judgment. Assign engineers to implement the pipeline and data scientists to validate eval quality.

Question 7

How do I decide whether my agent use case requires fine-tuning?

Accepted Answer

Before committing data science resources to fine-tuning, confirm the use case cannot be solved via context engineering alone — most can. Fine-tuning is warranted when you need the model to learn domain-specific patterns, vocabulary, or behaviors that cannot be adequately conveyed through prompts and context. If context engineering achieves acceptable performance, skip fine-tuning entirely. When fine-tuning is needed, treat it as a deliberate, scoped assignment for ML engineers rather than a default activity.

Question 8

Why is my AI agent team stuck in proof-of-concept mode?

Accepted Answer

The most common reasons are: missing eval and observability pipelines needed for production confidence, over-reliance on data scientists doing work outside their highest-leverage zone, absence of domain experts in context engineering, and confusing POC velocity with production readiness. The Hetzel framework diagnoses this by auditing team composition against three required roles and checking whether the evals-plus-observability feedback loop exists. Many teams are prolific at building demos but lack the infrastructure to safely deploy them.

Question 9

My LLM-as-judge evals look great but the agent fails in production — what went wrong?

Accepted Answer

Your LLM-as-judge evaluator is likely drifting from actual human judgment. LLM-as-judge is itself just a prompt and a model — it can systematically disagree with how real users and domain experts assess quality. The fix is to create a labelled dataset where humans annotate agent traces, then measure precision, recall, and F1 of the judge against that ground truth. This is the specific guardrails role the Hetzel framework assigns to data scientists. Without it, you have no way to know if your evals are actually measuring quality.

Question 10

Our data science team says they should own the agent project because they have AI expertise — how do I push back?

Accepted Answer

Acknowledge their expertise is essential but reframe their role. The foundational model is already built — the team's job is downstream. Data scientists add unique value in guardrails, risk assessment, and validating eval quality with labelled datasets. But agent development also requires systems engineering (distributed orchestration, infra, API integration) and context engineering (best done by domain experts with proximity to the problem). Present the Hetzel framework's three-role model as an upgrade that amplifies their contribution, not a demotion.

Question 11

What happens if I skip observability and only run evals before deployment?

Accepted Answer

Your confidence degrades immediately once the agent faces real users. Evals build confidence during experimentation, but production introduces distribution shifts, novel edge cases, and user behaviors your eval set never covered. Without observability, you have no mechanism to detect when the agent starts failing or drifting. The Hetzel framework treats evals and observability as two inseparable pillars — without both, agents that pass internal tests will degrade in the wild and you won't know until users complain.

Question 12

How does the Hetzel framework compare to a generic agile team structure for AI projects?

Accepted Answer

Generic agile structures assign roles like product owner, scrum master, and developers without specifying the unique responsibilities of agentic AI. The Hetzel framework goes further by identifying three specific role types (data scientists, engineers, domain experts), mapping each to concrete agent-specific responsibilities (guardrails, orchestration, context engineering), and warning against the Isolation Mistake. It also introduces the evals-plus-observability feedback loop as a structural requirement, which generic agile does not address.

Question 13

How is the Hetzel framework different from MLOps team structures?

Accepted Answer

MLOps team structures are designed around the traditional ML lifecycle: data ingestion, feature engineering, model training, deployment, and monitoring of model drift. The Hetzel framework starts from the premise that the model is already built and deployed via API. The focus shifts to context engineering instead of feature engineering, functional agent evaluation instead of narrow classification metrics, and a broader team that includes non-technical domain experts as first-class contributors — not just consumers of model outputs.

Question 14

Can I use the Hetzel framework for a team building RAG applications, not just agents?

Accepted Answer

Yes — the core principles apply to any LLM-powered application where the model is accessed via API. RAG applications still require context engineering (retrieval strategy and prompt design), systems engineering (vector databases, retrieval pipelines), and domain expertise (judging retrieval quality and answer correctness). The framework's emphasis on proximity to the problem and the evals-plus-observability feedback loop is equally relevant. The main difference is that agentic applications add orchestration complexity, making the systems engineering role even more critical.

Question 15

What metrics should I use to evaluate AI agent performance instead of precision and recall?

Accepted Answer

Define functional performance criteria — what the agent needs to do correctly, end-to-end, for a real user. These are task-specific: did the agent correctly process the loan document, schedule the right appointment, or resolve the customer issue? Domain experts should define these criteria. Traditional precision, recall, and F1 still have a place — but they apply to your evaluators (like LLM-as-judge), not to the agent's overall behavior. The Hetzel framework calls this the Broader Eval Surface Area principle.

Question 16

How should an AI-native startup apply the Hetzel framework differently from an enterprise?

Accepted Answer

AI-native startups typically have small cross-functional engineering teams with high proximity to the problem — their structure is already closer to the ideal. The main gap is usually the guardrails role: no one is stress-testing the LLM's statistical limitations or validating LLM-as-judge quality. The fix is to bring in a fractional data scientist or assign one engineer to own eval validation. Enterprises face the opposite problem — they need to break agents out of the isolated ML team and bring in domain experts and systems engineers.

Question 17

What is the guardrails role and who should fill it?

Accepted Answer

The guardrails role is the function of providing statistical rigor and risk awareness to an agent team. This includes reminding the team that LLMs are probabilistic token predictors, validating LLM-as-judge eval quality using labelled datasets and precision/recall/F1, and preventing over-trust in model outputs. Data scientists are best suited for this role because of their training in statistical methodology. In AI-native startups without a data scientist, one engineer should be explicitly assigned to own this function.

Question 18

How many people do I need on an agentic AI team?

Accepted Answer

The Hetzel framework does not prescribe a specific team size — it prescribes role coverage. You need representation from three role types: data scientists, product/systems engineers, and domain experts. A five-person AI-native startup can cover all three if roles are explicitly assigned. A large enterprise might have dozens of people but still fail if all of them are data scientists. The key is deliberate diversity of roles, not headcount. Even fractional or part-time coverage of a missing role type is better than none.

Question 19

Should product managers be on an AI agent development team?

Accepted Answer

Yes — product managers are one of the highest-leverage roles on an agent team. They have proximity to the problem, understand user behavior and intent, and can define what 'good' agent behavior looks like. The Hetzel framework positions PMs as co-owners of context engineering alongside domain experts. They should directly edit prompts and context, participate in human annotation of agent traces, and define the functional performance criteria that the agent will be evaluated against. Do not relegate them to a stakeholder-only role.

Question 20

What is an Agent Quality Platform?

Accepted Answer

An Agent Quality Platform is a category of tooling focused on maintaining confidence in agent execution through two pillars: Evals (pre-production evaluation to build confidence) and Observability (production monitoring to maintain confidence). Braintrust, the company Phil Hetzel represents, exemplifies this category. The platform closes the feedback loop by enabling teams to harvest production data back into offline eval datasets, track LLM-as-judge alignment with human judgment, and continuously improve agent quality.

Question 21

How do I prevent eval drift in my AI agent system?

Accepted Answer

Eval drift occurs when your LLM-as-judge evaluator gradually diverges from actual human judgment without anyone noticing. Prevent it by continuously adding grounded, human-labelled data to your eval dataset. Assign domain experts to annotate production agent traces, then have data scientists measure whether the LLM judge's assessments still align with those human labels using precision, recall, and F1. This is the core of the Hetzel framework's evals-plus-observability feedback loop. Without this mechanism, eval quality silently degrades over time.

Question 22

Can non-technical people really contribute to building AI agents?

Accepted Answer

Yes — non-technical domain experts are among the most valuable contributors on an agent team. Since agent behavior is controlled through context engineering (prompts, instructions, and context), not model retraining, people who deeply understand the problem domain can directly shape how the agent performs. They also serve as the best human annotators because they can judge not just whether the agent succeeded, but why it succeeded or failed. The Hetzel framework treats gating this behind technical intermediaries as a key pitfall.

Question 23

What is the difference between context engineering and prompt engineering?

Accepted Answer

Context engineering is the broader practice of modifying all inputs to an already-trained model — prompts, system instructions, retrieved context, tool definitions, conversation history, and any other information seeded into the model's context window. Prompt engineering is typically narrower, focusing on the specific user or system prompt. The Hetzel framework uses 'context engineering' deliberately to emphasize that the full input surface — not just the prompt template — is the lever for controlling agent behavior, and domain experts should own this layer.

Frequently Asked Questions About Hetzel Agent Team Composition Framework

// Basics