How Should Enterprise ML Teams Staff Agent Projects?

For Enterprise AI/ML engineering leads · Based on Hetzel Agent Team Composition Framework

// TL;DR

The Hetzel Agent Team Composition Framework helps enterprise ML engineering leads restructure their teams for agentic AI. If your ML platform team has been handed agent-building responsibility because 'it has AI in the name,' you're likely over-indexing on training metrics and under-including domain experts. The framework shows you how to redirect your ML engineers to eval validation and guardrails, bring in product engineers for API integration and infrastructure, and give domain experts real ownership of context engineering and human annotation.

Why does my ML team struggle with agentic AI projects?

Your ML team is optimised for a workflow that doesn't apply to agents. Traditional ML involves building data pipelines, training models, validating with cross-validation, and deploying. In agentic AI, the model is already built — by Anthropic, OpenAI, Mistral, or others. Your team's job is to implement, evaluate, and contextualise it.

This is the core insight of the Hetzel Agent Team Composition Framework. When Phil Hetzel diagnoses why enterprise agent projects stall, the root cause is almost always the same: the organisation classified agent-building as an ML problem and staffed it accordingly. The result is a team optimising for precision, recall, and F1 on what is actually a broad functional performance challenge.

Your ML engineers are not the problem — the staffing model is.

What roles should I reassign on my ML team for agent development?

The Hetzel framework maps every team member against three personas and assigns each a specific role:

Data Scientists / ML Engineers should stop trying to train models and instead own three things: (1) acting as the 'adult in the room' on LLM risk and statistical literacy, (2) validating LLM-as-judge evals against human-labelled datasets, and (3) leading fine-tuning only when the use case genuinely demands it.

Product / Application / Systems Engineers should own the LLM-as-API integration layer, manage infrastructure (especially critical for distributed multi-agent architectures with supervisor and sub-agents), and build the eval and observability pipelines.

Non-Technical Domain Experts — and this is where most enterprise teams have a gap — should own prompt and context engineering and perform human annotation of agent traces. These people hold the most proximity to the problem and their input is foundational, not cosmetic.

If your team has zero domain experts structurally involved, fix this before writing another line of agent code.

How do I redefine success metrics for enterprise agent projects?

Stop leading with precision, recall, and F1. Those are technical metrics suited to a two-box ML pipeline. The Hetzel framework requires you to define functional performance criteria: does the agent actually accomplish its intended purpose for real users?

For a customer service agent, functional performance means the agent correctly resolves the customer's query and does so safely. For an internal knowledge agent, it means the agent retrieves accurate information and presents it in a way the employee can act on.

Build your eval pipeline around these functional criteria using LLM-as-judge automation validated against human-labelled ground truth from domain experts. Then build an observability pipeline so production behaviour feeds back into your offline eval dataset continuously. The two pillars — evals and observability — are non-negotiable.

What's the first thing I should do as an enterprise ML lead?

Classify your organisation as a Traditional Enterprise using the Hetzel framework (you almost certainly are), then audit your current team composition for coverage gaps. Map every person to one of the three personas. Identify who is missing. Then recruit or embed at least one domain expert with real ownership over context engineering and annotation. Redefine your eval criteria to include functional performance. Build observability from day one.

The framework's step-by-step workflow gives you a concrete sequence to follow — start with Step 1 (classify organisation type) and work through to Step 7 (pressure-test against proximity to the problem).

// FREQUENTLY ASKED QUESTIONS

Can my existing ML platform team build AI agents without restructuring?

They can build something, but it won't be optimised for the actual challenge. Agentic AI requires context engineering and functional evaluation that ML teams aren't trained for. The Hetzel framework doesn't remove ML engineers — it redirects them to eval validation and guardrails while adding product engineers and domain experts who own the levers that actually change agent behaviour.

How do I justify hiring domain experts for an AI agent team to my VP of Engineering?

Frame it using the Hetzel principle that context engineering — not model training — is the primary lever for changing agent behaviour. Domain experts own this lever because they have the most proximity to the problem. Without them, your team optimises for technical metrics while the agent fails to solve the actual user problem. The business case is avoiding expensive rework after launch.

Should my ML engineers still run A/B tests on AI agents?

The Hetzel framework warns against applying the traditional cross-validation and A/B testing dance to agents because the pipeline is entirely different once the model is already built. Focus instead on functional eval pipelines validated against human annotation, and use observability to catch production failures. A/B testing may play a role but it's not the primary quality assurance mechanism for agents.