Frequently Asked Questions About Hetzel Agent Team Composition Framework

21 answers covering everything from basics to advanced usage.

// Basics

What is the difference between evals and observability for AI agents?

Evals are evaluation processes performed during experimentation before production — they build confidence in how an agent will behave once deployed. Observability is monitoring and analysis performed after the agent is in production — it maintains confidence as the agent encounters real users. The two form a feedback loop: production data from observability is harvested to update the offline eval dataset, keeping evaluations grounded in real-world performance.

What does proximity to the problem mean in agent team design?

Proximity to the problem measures how closely a team member understands the real-world task the agent is solving and how end users will interact with it. Domain experts, product managers, and subject matter experts typically have the highest proximity. In agentic AI, higher proximity translates to higher leverage because these individuals can engineer better prompts and context, spot edge cases, and judge agent behavior more accurately than someone disconnected from the use case.

Why is the foundational model already being built important for agent teams?

The fact that the foundational LLM has already been trained by providers like Anthropic and OpenAI means the entire upstream ML pipeline — data ingestion, training, cross-validation, deployment — is handled externally. Agent teams do not need to replicate that workflow. This fundamentally changes the skill requirements: the highest-leverage work shifts to context engineering, systems integration, orchestration, and evaluation rather than model training and feature engineering.

What is the guardrails role and who should fill it?

The guardrails role is the function of providing statistical rigor and risk awareness to an agent team — reminding everyone that the LLM is predicting tokens, not possessing knowledge, and preventing over-trust in model outputs. Data scientists are best suited for this role because they understand probability, can stress-test outputs against edge cases, and can measure whether evaluation mechanisms are statistically sound. Without this role, teams tend to over-trust agent outputs.

// How To

How do I audit my current agent team composition using the Hetzel framework?

Check whether your team includes all three required role types: (1) data scientists or ML engineers, (2) product, application, or systems engineers, and (3) non-technical domain experts, subject matter experts, or product managers. Flag any role type that is absent or underrepresented. A team composed entirely of data scientists or entirely of engineers is a red flag. Then verify each role is assigned to its highest-leverage responsibilities rather than doing generic agent development.

How do I bring domain experts into prompt engineering if they aren't technical?

Give domain experts direct access to edit or co-own the prompts and context seeded into the agent — do not gate this behind a technical intermediary. They should also serve as primary human annotators reviewing agent traces, because they can judge both whether the agent performed correctly and why. Use collaborative tools where domain experts can modify instructions in natural language while engineers handle the underlying architecture and integration.

How do I set up the evals plus observability feedback loop for agents?

Start by building offline evals that test agent behavior against defined functional criteria before deployment. Then implement observability to monitor production traces in real time. Close the loop by continuously harvesting production data and adding human-labelled annotations to your offline eval dataset. Track whether your LLM-as-judge evaluator maintains agreement with human judgment over time. If alignment drifts, update the judge's prompts or labelled dataset to correct it.

How do I assign data scientists to the right tasks on an agent team?

Data scientists should own three specific functions: (1) guardrails and risk assessment — reminding the team that the LLM predicts tokens, not knowledge; (2) LLM-as-judge validation — creating labelled datasets and measuring precision/recall/F1 of the judge against human agreement; (3) fine-tuning, if and only if context engineering alone is insufficient. Do not ask them to generically 'own agents' or to do systems engineering or prompt engineering — that is a misallocation of their statistical expertise.

// Troubleshooting

My agent works great in testing but fails in production — what went wrong?

You likely have strong evals but no observability. Confidence built during experimentation degrades rapidly once an agent faces real users without a production monitoring loop. Implement observability to track agent traces in production, assign domain experts to review failing traces, and feed production data back into your eval dataset. Also check whether your LLM-as-judge is drifting — validate it against a fresh labelled dataset to confirm it still agrees with human judgment.

Our data science team was assigned to build agents and they're stuck — why?

This is the classic isolation mistake. Data scientists are optimized for model training, feature engineering, and statistical analysis — but agent development requires systems engineering for multi-agent orchestration, API integration, and infrastructure, plus domain expertise for context engineering and functional evaluation. The fix is not to remove data scientists but to augment the team with product/systems engineers and domain experts, then reassign data scientists to their highest-leverage work: guardrails and eval validation.

Why is my LLM-as-judge evaluation drifting over time?

LLM-as-judge is itself just a prompt and a model — its assessments can shift as underlying model behavior changes or as the distribution of agent outputs evolves. Without a labelled human-agreement dataset to benchmark against, you have no way to detect this drift. The fix is to assign data scientists to regularly compare the judge's ratings against fresh human annotations using precision, recall, and F1 metrics, then update the judge's prompts or reference data as needed.

How do traditional enterprises typically get agent team structure wrong?

Traditional enterprises most commonly make the isolation mistake: assigning agent development to their existing ML or data science platform team because generative AI has 'AI' in the name. This team then tries to replicate the traditional ML pipeline — data ingestion, training, deployment — when the foundational model is already built. The fix is to restructure the team to include systems engineers for orchestration and domain experts for context engineering, while refocusing data scientists on guardrails and eval validation.

How do I prevent POC agents from stalling before production?

Many teams are prolific at building generative AI proof-of-concepts but fail to implement the eval and observability pipelines needed for production. The Hetzel framework addresses this by requiring both pillars from the start. Assign engineers to build eval infrastructure and observability from day one, not as an afterthought. Assign domain experts to define functional success criteria and begin human annotation early. This converts POC velocity into production readiness rather than a dead-end demo.

// Comparisons

How does the Hetzel framework compare to just using a traditional ML team structure for AI?

Traditional ML team structures center on the training pipeline: data ingestion, feature engineering, model training, cross-validation, and deployment. The Hetzel framework recognizes that for agentic AI, that pipeline is already done by model providers. It replaces the training-centric structure with a three-role model focused on context engineering (domain experts), systems orchestration (engineers), and eval validation (data scientists). This avoids the isolation mistake of treating agents like another ML model to train.

How is context engineering different from feature engineering?

Feature engineering modifies model behavior by transforming input data and retraining the model. Context engineering modifies agent behavior by changing the prompts, instructions, and context fed to an already-trained model — no retraining required. This is a fundamental shift: the highest-leverage skill moves from statistical data transformation to understanding the problem domain well enough to write better instructions. Domain experts and product managers can do context engineering; feature engineering requires data scientists.

Hetzel framework vs generic cross-functional team advice — what's different?

Generic cross-functional team advice says 'include diverse roles.' The Hetzel framework prescribes exactly which roles own which responsibilities in agentic AI specifically. It maps data scientists to guardrails and LLM-as-judge validation, engineers to distributed orchestration and observability pipelines, and domain experts to context engineering and human annotation. It also defines the evals-plus-observability feedback loop as a structural requirement and explicitly names the isolation mistake as the primary anti-pattern to avoid.

// Advanced

Can a small startup with no data scientists use the Hetzel framework?

Yes, but with adaptation. AI-native startups often have strong engineering proximity to the product but lack the guardrails role. The framework recommends hiring a fractional data scientist or assigning one engineer to own eval validation and statistical rigor — specifically, creating labelled datasets and validating LLM-as-judge accuracy. The critical step is ensuring domain experts (e.g., end users or industry specialists) are directly involved in context engineering and annotation, not just consulted occasionally.

How do I know if my agent use case needs fine-tuning or just better prompts?

Start with context engineering — refining prompts, system instructions, and contextual information fed to the model. Measure functional performance against criteria defined by domain experts. If the agent consistently fails on specific tasks despite optimized context, and those failures trace to the model's base capabilities rather than input quality, then fine-tuning may be warranted. Fine-tuning should be a deliberate escalation, not a default starting point, because most agent use cases can be solved at the context layer.

What does the broader eval surface area mean for agent testing?

Agents must be evaluated across the full agent trace — including tool calls, sub-agent orchestration, multi-turn reasoning, and final output — not just on a narrow classification metric. Traditional ML evaluation focuses on precision, recall, and F1 for a single prediction. Agent evaluation requires functional criteria: did the agent do the right thing end-to-end for the user? Domain experts define these criteria; data scientists validate the evaluation mechanism used to measure them.

Should domain experts or engineers write the prompts for AI agents?

Domain experts should write or co-own the prompts because they have the highest proximity to the problem the agent is solving. They understand real user behavior, edge cases, and what constitutes good agent output. Engineers should handle the infrastructure that delivers those prompts — templating systems, context injection, and orchestration — but the actual content of the instructions should be driven by the people who understand the domain, not gated behind technical intermediaries.

How does the Hetzel framework apply to multi-agent architectures?

Multi-agent architectures — where a supervisor agent calls child or sub-agents running on different infrastructure — are fundamentally distributed systems problems, not statistics problems. The Hetzel framework assigns systems and product engineers to own this orchestration layer. Data scientists focus on validating the eval quality for each sub-agent and the overall trace. Domain experts define what correct end-to-end behavior looks like across the full chain of agent interactions.