How Should Data Scientists Adapt Their Role for AI Agents?
For Data scientists and ML engineers transitioning to agentic AI · Based on Hetzel Agent Team Composition Framework
// TL;DR
If you're a data scientist or ML engineer assigned to build an AI agent, the Hetzel Agent Team Composition Framework redefines your highest-value contribution. The model is already built — your job is not training or feature engineering. Instead, you should own eval validation (ensuring LLM-as-judge assessments align with human labels), guardrails and statistical risk analysis, and fine-tuning only when genuinely required. The framework also tells you who you need on your team: product engineers for infrastructure and domain experts for context engineering. Understanding this shift prevents you from optimising for the wrong metrics.
Why can't I just apply my ML skills to building agents?
You can — but the Hetzel framework reveals that your ML skills are most valuable in a different place than you expect. Traditional ML development centres on training data, feature engineering, model selection, cross-validation, and deployment. Agent development looks fundamentally different because the model is already built by Anthropic, OpenAI, Mistral, or others.
Your traditional pipeline — collect data, engineer features, train model, validate with precision/recall/F1, deploy — doesn't map cleanly to agents. The data pipeline of training and testing has already been done. The team's job is to implement, evaluate, and contextualise a pre-built model. This shifts which skills matter most.
The Hetzel framework doesn't sideline you. It redirects you to your highest-value contribution.
What should my specific role be on an agentic AI team?
The Hetzel framework assigns data scientists and ML engineers three agent-specific responsibilities:
1. The 'adult in the room' on LLM risk and statistical literacy. You understand distributions, confidence intervals, failure modes, and edge cases better than your product engineering or domain expert teammates. Apply that literacy to agent behaviour — not to model training.
2. Eval validation. When your team uses LLM-as-judge evaluations (where a language model scores agent outputs), you validate those judges against human-labelled ground truth. Calculate precision, recall, and F1 for the judge-human alignment — not for the agent itself. If the judge drifts from human agreement, you catch it. This is where your metrics expertise is irreplaceable.
3. Fine-tuning leadership. If — and only if — the use case genuinely requires fine-tuning an open-source model, you lead that effort. But the Hetzel framework is explicit: most agent behaviour is changed through context engineering (adjusting prompts and inputs), not retraining. Don't default to fine-tuning because it's familiar.
Who else needs to be on my team and why?
The Hetzel framework requires two other personas alongside you:
- Product / Application / Systems Engineers: They treat LLMs as APIs — send a payload, receive a response, make it useful. They own the infrastructure, especially for distributed multi-agent architectures, and build the observability pipelines that connect production behaviour back to your evals. You need them because agent deployment is a systems engineering problem, not a data science problem.
- Non-Technical Domain Experts / SMEs: These are the people closest to the problem the agent solves. They own context engineering (the primary lever for changing agent behaviour) and human annotation (reviewing agent traces and labelling quality with reasoning). You need them because without domain grounding, your evals will measure technical correctness without capturing functional performance — whether the agent actually solves the user's problem.
If you're on a team with no domain expert, the Hetzel framework says to fix this before building further.
How do I avoid the trap of optimising for the wrong metrics?
Precision, recall, and F1 are your comfort zone — and for traditional ML, they're exactly right. But agents operate across a far broader surface area than a classification model. Locking on to these metrics as the primary eval signal is what the Hetzel framework calls a trap.
Instead, help your team define functional performance criteria: Does the agent actually resolve the customer's query? Does it produce a correct and complete analysis? Does it handle edge cases safely? Use your statistical skills to measure these broader criteria rigorously, not to reduce agent quality to a single F1 score.
Your metrics expertise is essential — but apply it to the right questions.
What should I do this week?
Audit your current agent project against the Hetzel framework's three personas. If you're the only data scientist on a team of engineers with no domain expert, flag it. If your evals rely solely on LLM-as-judge without human validation, build that validation loop. If you've been defaulting to fine-tuning, investigate whether context engineering could achieve the same result. Your role on this team is crucial — but it's different from what you're used to.
// FREQUENTLY ASKED QUESTIONS
Am I being replaced by prompt engineers as a data scientist on agent teams?
No. The Hetzel framework explicitly requires data scientists as the 'adult in the room' on LLM risk, statistical validation, and eval quality. Your role shifts from model training to eval validation and guardrail ownership. Prompt and context engineering is primarily owned by domain experts — the people with proximity to the problem — but your statistical rigour ensures the overall system is trustworthy. You're not replaced; you're redirected to higher-value work.
Should I still use precision, recall, and F1 when evaluating agents?
Yes, but not as the primary eval signal for the agent itself. Use these metrics specifically to validate LLM-as-judge alignment with human labels — measuring whether your automated evaluator agrees with human assessments. For the agent's overall quality, define functional performance criteria with your domain experts: does the agent accomplish its purpose for real users? Your metrics skills are essential but must target the right questions.
How do I know if my agent use case genuinely requires fine-tuning?
Exhaust context engineering first — adjust prompts, context, and inputs to change agent behaviour. If the use case requires specialised reasoning patterns that general LLMs cannot achieve through prompting, domain-specific output formats, or latency-sensitive deployment with smaller models, fine-tuning may be justified. The Hetzel framework treats fine-tuning as rare. If you're defaulting to it because it's familiar from traditional ML, reconsider.