How Should Enterprise Leaders Staff Agentic AI Teams?
For Enterprise AI leaders and VPs of Engineering · Based on Hetzel Agent Team Composition Framework
// TL;DR
Enterprise AI leaders face a specific trap: handing agent development to the existing ML or data science team because 'it has AI in the name.' The Hetzel Agent Team Composition Framework gives you a diagnostic to restructure your team around three essential personas — data scientists for guardrails, product engineers for infrastructure, and domain experts for context engineering. Use it when your executive team is allocating headcount for agentic AI, when an existing ML team is struggling with agent quality, or when agent output isn't matching user needs despite strong technical metrics.
Why does my ML team's agent project keep missing the mark?
The most common enterprise mistake in agentic AI is delegating agent development entirely to your ML or data science platform team. This feels logical — agents are AI, your ML team does AI — but it creates a fundamental mismatch. Unlike traditional ML, the model is already built by Anthropic, OpenAI, or Mistral. Your team's job is to implement, evaluate, and contextualise that model. This shifts the most valuable skills away from model training and toward context engineering, functional evaluation, and systems integration.
The Hetzel Agent Team Composition Framework classifies your organisation as a Traditional Enterprise and identifies your default risk: over-indexing on ML metrics (precision, recall, F1) while under-including non-technical domain experts who actually understand the problem the agent is meant to solve.
How should I restructure my agent team as an enterprise leader?
Start by auditing your current team against three personas:
1. Data Scientists / ML Engineers — Redirect them from model training to their highest-value agent role: acting as the 'adult in the room' on LLM risk, validating LLM-as-judge evals against labelled datasets, and leading fine-tuning only when genuinely required.
2. Product / Application / Systems Engineers — These people manage the LLM-as-API integration, build the infrastructure for distributed agent architectures (supervisor + sub-agents on different compute), and own the eval and observability pipelines.
3. Non-Technical Domain Experts / SMEs — This is where most enterprises fail. Customer service specialists, compliance officers, legal analysts — whoever understands the problem the agent solves — must have meaningful control over prompt and context engineering. They must also participate in human annotation workflows, reviewing agent traces and labelling performance with reasoning.
If any persona is missing entirely, that is a structural gap you must fill before expecting production-quality agent output.
What eval and observability changes should I mandate?
Stop treating precision, recall, and F1 as the primary eval signal for agents. These technical metrics suit two-box ML classification pipelines, not the broad functional surface area of agentic behaviour. Instead, mandate functional performance evaluation: does the agent actually resolve the customer's query, produce a correct analysis, or complete the workflow safely?
Require two pipelines:
- Evals (pre-production): Build confidence before deployment. Use LLM-as-judge assessments validated against human-labelled ground truth.
- Observability (post-production): Monitor agent behaviour with real users. Feed production data back into the offline eval dataset continuously.
These pipelines must be defined jointly by all three personas, not owned exclusively by engineers.
What should I do next?
Run the Hetzel framework's 7-step workflow on your current agent initiative this week. Classify your org type, audit your team, assign persona-specific roles, and pressure-test against the 'Proximity to the Problem' principle. If your team has no domain expert involved, fix that before your next sprint.
// FREQUENTLY ASKED QUESTIONS
Why shouldn't I just let my ML team handle our agent project?
Because the model is already built. Your ML team's traditional skills — model training, feature engineering, cross-validation — are not the primary value drivers in agentic AI. They should focus on guardrails, eval validation, and statistical literacy. You also need product engineers for infrastructure and domain experts for context engineering. A team of only ML engineers will optimise for the wrong metrics.
How do I convince my CTO that domain experts need to be on the agent team?
Frame it around the 'Proximity to the Problem' principle: the primary lever for changing agent behaviour is context engineering — adjusting prompts and inputs — not retraining. Domain experts are the best context engineers because they understand what the agent is actually meant to accomplish. Without them, you get technically fluent agents that miss critical domain requirements. Show examples of agent failures that stem from missing contextual grounding.
What metrics should I track instead of precision and recall for my agents?
Track functional performance: does the agent resolve the user's actual problem correctly and safely? Define success criteria with domain experts, not just engineers. Use precision, recall, and F1 specifically to validate LLM-as-judge alignment with human labels — not as the primary measure of agent quality. Also track observability metrics like production failure patterns, edge case frequency, and human-agent agreement drift over time.