How Should Product Managers Shape AI Agent Team Structure?
For Product managers leading AI agent initiatives · Based on Hetzel Agent Team Composition Framework
// TL;DR
As a product manager leading an AI agent initiative, you are not a spectator — you are one of the highest-leverage contributors on the team. The Hetzel Agent Team Composition Framework positions PMs and domain experts as owners of context engineering (the prompts and instructions that drive agent behavior) and as the primary definers of functional success criteria. Use this framework to advocate for cross-functional team structure, prevent the isolation mistake of leaving agents to data scientists alone, and ensure your domain knowledge directly shapes agent behavior rather than being filtered through technical intermediaries.
Why Are Product Managers Critical on AI Agent Teams?
The Hetzel framework's core principle of proximity to the problem explains why. In agentic AI, the foundational model is already trained — behavior is changed not through retraining but through context engineering: modifying the prompts, instructions, and context fed to the model. The person who best understands the problem, the user, and what 'good' looks like holds disproportionate value.
That person is usually you — the product manager or domain expert, not the ML engineer.
Your proximity to users, your understanding of edge cases, and your ability to define functional success criteria make you essential for context engineering, human annotation, and evaluation design. Without your input, engineers will build technically sound agents that solve the wrong problem or miss critical real-world nuances.
How Should I Advocate for the Right Team Structure?
Use the Hetzel framework's three-role model to audit your current team and make the case for restructuring:
1. Data Scientists / ML Engineers should own guardrails, LLM-as-judge validation, and fine-tuning only when needed.
2. Product / Systems Engineers should own API integration, multi-agent orchestration, infrastructure, and the eval-observability pipeline.
3. Domain Experts / PMs — that is you — should own context engineering, human annotation, and the definition of what 'good agent behavior' looks like.
If your team is entirely data scientists, flag the isolation mistake. Frame it to leadership as a capability gap, not a personnel criticism: agent development requires systems engineering and domain expertise that the ML team was never staffed to provide.
Bring this framework to your next team planning session with concrete role assignments for each team member. Map each person to the responsibilities where they create the most value.
How Do I Take Ownership of Context Engineering?
Demand direct access to the prompts and instructions that drive your agent. Do not accept a workflow where you describe requirements to engineers who then translate them into prompts — you lose critical nuance in that translation.
Write and edit the agent's system prompts, task instructions, and contextual information yourself, or at minimum co-own them with an engineer who handles templating and integration. You know that a customer asking about 'my balance' might mean checking account, savings, or credit card depending on context. You know that 'urgent' means different things to different user segments. This knowledge must go directly into the agent's instructions.
Also own the human annotation workflow: regularly review agent traces and label whether the agent performed well, with explanations of why. This creates the grounded dataset that validates the entire evaluation pipeline.
How Do I Define Functional Success Criteria for Agents?
Traditional ML teams default to precision, recall, and F1 as success metrics. The Hetzel framework identifies this as a category error when applied to agents. These metrics belong to evaluators, not to the agent's end-to-end behavior.
As PM, define functional criteria:
- What must the agent do correctly, end-to-end, for a real user to consider the interaction successful?
- What are the failure modes that would cause user harm, trust loss, or business damage?
- What edge cases occur in the real world that the agent must handle gracefully?
Document these criteria and ensure both your evals and your observability dashboard measure against them. Let data scientists validate the eval mechanism's statistical quality against your labelled dataset, but you define what 'correct' means.
What Should I Do Next?
Bring the Hetzel framework to your next agent team meeting. Audit your team composition against the three required roles. Claim your ownership of context engineering and functional success criteria. Set up a human annotation workflow where you or your domain experts review agent traces weekly. And advocate for the evals-plus-observability feedback loop — without it, your agent will degrade post-launch and you will be diagnosing failures without data.
// FREQUENTLY ASKED QUESTIONS
Should product managers write prompts for AI agents?
Yes — product managers and domain experts should write or co-own the prompts and context that drive agent behavior. The Hetzel framework's proximity-to-the-problem principle shows that the person closest to the user and the problem has the highest leverage for context engineering. Do not gate prompt creation behind technical staff who lack your domain understanding.
How do product managers define success metrics for AI agents?
Define functional success criteria based on what the agent must do correctly for real users end-to-end, not traditional ML metrics like precision and recall. Identify critical failure modes, user-facing quality standards, and real-world edge cases. These criteria become the benchmark for evals and observability. Let data scientists validate the evaluation mechanism, but as PM you define what 'correct' means.
What is the PM's role in AI agent evaluation and observability?
PMs own two critical inputs to the eval pipeline: defining functional success criteria and leading human annotation. Review agent traces regularly, label them as correct or incorrect with explanations, and ensure this labelled data feeds back into the eval dataset. This grounds your automated evaluations in real-world quality standards. Without PM involvement, evals drift toward technical metrics that miss user-facing quality issues.