How Should Data Scientists Contribute to AI Agent Teams?

For Data scientists and ML engineers transitioning to agentic AI · Based on Hetzel Agent Team Composition Framework

// TL;DR

The Hetzel Agent Team Composition Framework redefines the data scientist's role on agentic AI teams. Instead of owning agents generically, data scientists add their highest-leverage value through guardrails, risk assessment, and eval validation — specifically by creating labelled datasets and measuring whether LLM-as-judge evaluators align with human judgment using precision, recall, and F1. This is not a demotion; it is a recognition that the model is already built, agent behavior is controlled via context engineering, and the unique statistical rigor data scientists bring is most impactful in the eval quality layer.

Why Aren't Data Scientists the Default Owners of AI Agent Projects?

The foundational LLM has already been trained and deployed via an API by Anthropic, OpenAI, Mistral, or similar providers. The entire upstream data pipeline — data ingestion, model training, cross-validation, deployment — has already been done. This is what the Hetzel framework calls the 'Model Is Already Built' principle.

This means the traditional ML workflow that data scientists are trained for — collecting data, engineering features, training models, evaluating on held-out sets — does not apply to most agent development. Agent behavior is changed through context engineering: modifying the prompts, instructions, and context fed to the model. This work is highest-leverage when performed by domain experts with proximity to the problem, not by data scientists working in isolation.

This is not a statement that data scientists are unneeded. It is a statement that their unique contribution is elsewhere.

What Is the Highest-Leverage Role for Data Scientists on Agent Teams?

The Hetzel framework assigns data scientists to the guardrails and eval validation function. This includes:

- Being the statistical conscience: Reminding the team that the LLM is predicting tokens probabilistically, not 'knowing' anything. Preventing over-trust in model outputs.

- Validating LLM-as-judge quality: LLM-as-judge is a powerful eval mechanism, but it is itself just a prompt and a model. Data scientists create labelled datasets where humans annotate agent traces, then measure precision, recall, and F1 of the judge against that ground truth.

- Preventing eval drift: Over time, the LLM judge can diverge from human judgment. Data scientists track this alignment continuously and flag when recalibration is needed.

- Risk assessment: Identifying failure modes, edge cases, and statistical limitations that the rest of the team may overlook.

- Fine-tuning (when required): Fine-tuning an open-source model is rare in agent development, but when genuinely needed, it is the highest-leverage technical contribution for ML engineers.

This role is critically important. Without it, teams ship agents that pass unvalidated evals and fail in production.

How Do Traditional ML Metrics Apply to Agent Evaluation?

This is where many data scientists make a category error. Precision, recall, and F1 are designed for classification tasks with well-defined positive and negative classes. Applying them directly to agent behavior — which involves multi-step reasoning, tool use, and open-ended outputs across a full trace — is a mismatch.

The Hetzel framework introduces the concept of Broader Eval Surface Area: agents must be evaluated on functional performance (did the agent accomplish the user's task correctly, end-to-end?), not on narrow classification metrics.

However, precision, recall, and F1 do apply — to the evaluators themselves. When you use LLM-as-judge to assess agent quality, the judge is making a classification decision (good output vs. bad output). Data scientists should measure how well that classification aligns with human-labelled ground truth. This is where traditional ML metrics find their correct application in the agent world.

How Should Data Scientists Position Themselves on Agentic AI Teams?

Stop trying to own the entire agent — that framing leads to the Isolation Mistake where the ML team is doing systems engineering and prompt engineering they are not best suited for. Instead, position yourself as the person who ensures the team's confidence in agent quality is grounded in evidence, not vibes.

Concretely:

1. Build the labelled dataset from production agent traces, annotated by domain experts.

2. Measure LLM-as-judge alignment with human agreement using precision, recall, and F1.

3. Report eval drift metrics to the team regularly.

4. Challenge the team when they over-trust model outputs.

5. Own fine-tuning decisions — determine when context engineering is insufficient and fine-tuning is genuinely warranted.

This positions you as indispensable without asking you to do work that domain experts or systems engineers would do better.

What Should You Do Next?

If you are currently assigned to 'own' an agent project, audit whether you are actually doing guardrails and eval validation or whether you have been inadvertently pulled into systems engineering and prompt engineering. Advocate for bringing in a systems engineer for infrastructure and domain experts for context engineering. Then claim the guardrails role explicitly — build your first labelled dataset, run your first LLM-as-judge validation, and demonstrate the value of grounded eval quality to your team.

// FREQUENTLY ASKED QUESTIONS

Should data scientists learn prompt engineering for AI agents?

Understanding prompt engineering is useful context, but it should not be your primary contribution. The Hetzel framework assigns prompt and context engineering to domain experts and PMs who have the highest proximity to the problem. Your highest-leverage contribution is validating eval quality, creating labelled datasets, and ensuring the team's confidence in agent behavior is grounded in statistical evidence rather than intuition.

Are precision, recall, and F1 still relevant for AI agents?

Yes, but they apply to evaluators, not to the agent itself. When you use LLM-as-judge to assess agent quality, you are making a classification decision that can be measured with precision, recall, and F1 against human-labelled ground truth. Applying these metrics directly to multi-step agent behavior across a full trace is a category error that the Hetzel framework explicitly warns against.

How do I avoid being sidelined as a data scientist on an agent team?

Own the guardrails and eval validation function explicitly. Build the labelled dataset, validate LLM-as-judge alignment with human judgment, track eval drift, and serve as the team's statistical conscience. This role is indispensable — without it, the team ships agents based on unvalidated evals. Position yourself as the person who ensures confidence in agent quality is evidence-based, and you become the most trusted voice on the team.