How Should Healthcare AI Teams Monitor Agent Quality?

For Healthcare AI product managers · Based on Hetzel Agent Observability Differentiation Framework

// TL;DR

Healthcare AI product managers need observability that goes far beyond uptime dashboards. The Hetzel Agent Observability Differentiation Framework shows how to layer functional observability — clinical accuracy, guideline adherence, grounding in patient records — on top of traditional technical monitoring. It requires clinicians to participate as annotators, reviewing agent traces and providing written justifications that seed automated scoring functions. This framework ensures your triage assistants, clinical decision support agents, and patient-facing chatbots are monitored for both technical reliability and clinical quality.

Why Is Datadog Not Enough for Our Clinical AI Agent?

Your DevOps team monitors the clinical AI agent with Datadog and reports 99.9% uptime with sub-200ms latency. Everything looks green. But a registered nurse reviewing actual patient interactions discovers the agent recommended a contraindicated medication pathway for patients with a specific comorbidity. Datadog saw nothing wrong.

This is the Scope Difference Principle from the Hetzel framework in action. Traditional observability answers one question: is the system up and performing technically? Agent observability in healthcare must answer much harder questions: Was the agent's response grounded in the patient's retrieved medical records? Did it follow approved clinical pathways? Did it use appropriate clinical language? Did it correctly interpret lab results before making a recommendation?

These are functional observability questions that require purpose-built tooling and — critically — clinical domain expertise to evaluate.

Who Should Be Reviewing Agent Traces in a Healthcare Setting?

The Hetzel framework's Dual Persona Requirement is non-negotiable in healthcare. Your engineering team monitors latency, error rates, and token counts. But only clinicians — registered nurses, physicians, pharmacists — can evaluate whether the agent's clinical reasoning was appropriate, whether its recommendations were safe, and whether its language met clinical communication standards.

The framework requires a human annotation workflow where clinicians review production traces, assign quality grades, and write detailed justifications for their grades. The justifications are the critical output, not just the grades. A grade of "poor" tells you nothing actionable. A justification that says "agent recommended ACE inhibitor for a patient with documented angioedema history, which is contraindicated" tells you exactly what scoring function to build.

These justifications become the raw material for automated clinical accuracy scoring functions that can run at scale across thousands of daily interactions.

How Do We Catch Clinical Failures We Didn't Anticipate?

Some failure modes are predictable: you can build automated checks for groundedness in patient records, adherence to known clinical guidelines, and correct drug interaction awareness. The Hetzel framework calls these known unknowns.

But in healthcare, the most dangerous failures are often unknown unknowns — patterns you never thought to test for. The framework addresses this through LLM-based embedding and clustering over production traces. This might surface unexpected patterns like: patients are asking the triage agent about off-label drug uses that the agent handles inconsistently, or the agent gives different recommendations for the same symptoms depending on how the patient phrases their complaint.

These emergent failure modes only become visible when you analyze production trace data at scale using clustering and topic modeling — and then validate the findings with clinical domain experts.

How Do We Close the Loop Between Finding Problems and Fixing Them?

The Hetzel framework's iteration loop is especially critical in healthcare where agent quality directly impacts patient safety. When a clinician flags a problematic trace during annotation, or when clustering reveals an unexpected failure pattern, the workflow should immediately allow the team to:

1. Add the problematic trace to an offline evaluation dataset

2. Run batch evals with modified agent configurations

3. Validate that the fix resolves the issue without introducing regressions

The framework emphasizes that observability and evals are the same system — the same trace infrastructure, the same scoring functions. Don't build separate pipelines. In healthcare, the speed of this iteration loop directly correlates with how quickly you can remediate patient safety concerns.

Remember: your existing Datadog or Grafana setup remains valuable for technical monitoring. The Hetzel framework explicitly recommends keeping traditional tools for the technical layer and layering agent-specific clinical quality observability on top.

Next step: Identify 3-5 clinicians willing to participate in weekly trace annotation sessions. Start with the framework's Step 6 (design the human annotation workflow) and ensure every annotation includes written clinical justifications, not just pass/fail grades.

// FREQUENTLY ASKED QUESTIONS

How do clinicians participate in agent observability without it being too technical?

Build a separate trace review interface designed for clinicians — showing full conversation traces in natural language, not engineering dashboards. Clinicians browse patient interactions, assign quality grades on clinical accuracy and safety, and write justifications explaining their reasoning. The interface should support natural language search so clinicians can find traces by clinical topic. No SQL or dashboard expertise required.

What automated scoring functions should we build first for a clinical AI agent?

Start with groundedness checks — verifying the agent's response was grounded in the patient's retrieved medical records and not hallucinated. Then add clinical guideline adherence checks and drug interaction awareness scoring. These are known unknowns you can anticipate. Build them using justifications from clinician annotations as your specification. Layer in clustering for unknown unknowns as your second priority.

How do we handle the large trace sizes from clinical agents that include full patient records?

Clinical agent traces can be extremely large because they include retrieved patient records, lab results, and conversation context. This is the 'agent traces are nasty' problem. You need trace infrastructure with write-ahead log ingestion for real-time visibility, full-text indexing for searching clinical terms across traces, and tiered storage. Standard log management tools will fail at this scale and structure.