Hetzel Agent Observability Differentiation Framework
Accurately diagnose whether a given AI agent system requires traditional observability tooling, agent-specific observability, or both — and design the right observability stack accordingly.
// TL;DR
The Hetzel Agent Observability Differentiation Framework is a diagnostic framework for determining whether an AI agent system needs traditional observability tooling (Datadog, Grafana), agent-specific observability, or both. Use it whenever you're designing, auditing, or advising on the observability strategy for LLM-based agent systems. It separates technical metrics like latency and uptime from functional metrics like groundedness, tool-use correctness, and brand alignment — and identifies which stakeholders (engineers vs. domain experts) need access to each layer. Apply it when existing monitoring tools feel insufficient for agent workloads.
// When should you apply the Hetzel Agent Observability Differentiation Framework?
Use this skill whenever you are designing, auditing, or advising on the observability strategy for an AI agent system. Also apply it when stakeholders ask why existing tools like Grafana or Datadog are insufficient for agent workloads.
// What information do you need before applying the Hetzel framework?
- system_typerequired
Is the system a traditional deterministic application, an LLM-based agent, or a hybrid? - existing_toolingrequired
What observability tools are already in use (e.g., Datadog, Grafana, ClickHouse)? - stakeholder_personasrequired
Who needs to consume the observability data — systems engineers only, or also domain experts like clinicians, lawyers, wealth advisors? - agent_use_caserequired
What is the agent doing? What domain does it operate in? What does 'quality' mean for this agent? - scale_expectations
Expected volume of agent interactions in production — rough order of magnitude.
// What core principles drive the Hetzel Agent Observability Differentiation Framework?
Scope Difference Principle
Traditional observability is entirely about uptime and technical performance — is the application up, and is it delivering the expected user experience from a technical lens. Agent observability has a fundamentally different and broader scope: it must also measure qualitative, semantic, and reasoning-level properties of agent behavior.
Non-Determinism Problem
Traditional applications have deterministic code paths and known control flow. Agents are non-deterministic because LLMs have high variety and are abstracted. This means you cannot rely on constrained, known metrics alone — you must measure why an agent took one path versus another.
Agent Traces Are Nasty
Agent traces are highly semi-structured, contain massive volumes of unstructured text, can exceed a gigabyte per trace with individual spans reaching 20 megabytes, and must still be delivered in true real time. This is a completely new systems problem that traditional observability infrastructure was not designed to handle.
Functional vs. Technical Observability
Agent observability has two layers: functional observability (was the agent's output grounded, aligned to brand standard, using the right tools, producing quality responses?) and technical observability (latency, time to first token, token count, duration). Technical observability comes 'on the house' when you properly trace an agent; functional observability requires purpose-built tooling.
Dual Persona Requirement
Traditional observability is consumed exclusively by systems or product engineers. Agent observability, done well, must include both technical and non-technical people — domain experts like clinicians, lawyers, or wealth advisors who are closest to the users or the problem space and can evaluate qualitative agent quality in natural language.
Observability and Evals Are the Same Problem
Observability and evals are solved with the same underlying system. The only difference is that with evals, you know the inputs ahead of time and run them in batch; with observability, you do not know the inputs ahead of time and process them in real time.
Known Unknowns vs. Unknown Unknowns
Some agent quality issues are known unknowns — failure modes you can anticipate and build automated scores for. Others are unknown unknowns — patterns and issues you discover by running clustering and topic modeling over production traces. A complete agent observability strategy must address both.
Human Annotation as a Scaling Seed
Human annotation — having domain experts grade agent traces and justify their grades — is a critical step. Those justifications become the raw material for building scalable, automated scoring functions. You find the failure modes through human review first, then systematize them.
// How do you apply the Hetzel Agent Observability Differentiation Framework step by step?
- 1
Classify the system's determinism profile
Ask: does this system follow known control flow with deterministic code paths, or does it use an LLM where the reasoning path is variable? If deterministic, traditional observability may be sufficient. If non-deterministic (agent), proceed to step 2.
- 2
Audit the scope of what needs to be measured
Separate technical metrics (latency, duration, time to first token, total tokens, error counts, cache hits) from functional/qualitative metrics (groundedness in retrieved context, correct tool usage, alignment to brand standard in system prompt, response quality). Traditional tools cover the former; agent observability tooling is required for the latter.
- 3
Assess the trace data characteristics
Determine expected trace size and structure. If traces are semi-structured, contain large volumes of unstructured text, or could exceed megabytes per span, flag this as a 'agent traces are nasty' scenario. Traditional observability databases (including OLAP tools like ClickHouse) are likely insufficient without augmentation for full-text indexing and write-ahead log ingestion.
- 4
Map the required read patterns
Identify whether consumers need: (a) real-time streaming visibility as interactions happen, (b) SQL/analytical queries over historical traces for experimentation, or (c) full-text search across trace content (e.g., 'show me every trace that mentioned the word X'). If all three are needed, a purpose-built agent trace database or stack is required, not a standard observability backend.
- 5
Identify all stakeholder personas who need access
List every role that must consume or act on observability data. If the list includes only systems/product engineers, traditional tooling personas apply. If it includes domain experts (clinicians, lawyers, wealth advisors, compliance officers), the platform must support natural language interaction and trace review by non-technical users.
- 6
Design the human annotation workflow
Establish a process where domain experts review production traces, assign quality grades, and — critically — provide written justifications for their grades. These justifications are the seed for building automated, scalable scoring functions. Do not skip justifications; grades alone are insufficient.
- 7
Separate known unknowns from unknown unknowns
For known unknowns: define explicit automated scores tied to anticipated failure modes (groundedness checks, tool-use checks, brand alignment checks). For unknown unknowns: implement lightweight LLM-based embedding and clustering over production traces to surface emergent topics, user intent patterns, sentiment signals, and unexpected failure modes you did not anticipate.
- 8
Close the iteration loop between production and experimentation
Once issues surface in production observability traces, the workflow should make it fast and direct to: (a) add those traces to an offline dataset, (b) run evals against them in batch, and (c) experiment with agent changes. The goal is to compress the time between 'problem seen in production' and 'fix validated in experimentation'.
- 9
Confirm whether traditional observability tools are still needed in parallel
Traditional tools like Datadog or Grafana are still valid and useful for the technical layer — 400/500 errors, uptime, website-level performance. Do not replace them; layer agent observability on top for functional quality. The two are complementary, not competitive, at this layer.
// What does the Hetzel framework look like in real-world agent observability scenarios?
A healthcare company has built an LLM-based triage assistant and their existing DevOps team wants to monitor it using their current Datadog contract.
Apply the Scope Difference Principle: Datadog will capture latency, error rates, and uptime (technical observability) but cannot evaluate whether the assistant's responses were grounded in the patient's retrieved records, whether it recommended appropriate clinical pathways, or whether it deviated from approved clinical language. Invoke the Dual Persona Requirement: registered nurses and clinicians must be included in the observability workflow to grade traces and provide justifications. Use the Human Annotation as a Scaling Seed principle to turn their justifications into automated scoring functions for clinical accuracy and protocol adherence. Flag Agent Traces Are Nasty: trace payloads containing full conversation context and retrieved medical documents will be large and semi-structured, requiring a purpose-built ingestion and indexing strategy.
A fintech team is building a wealth management agent and wants to understand how users are actually using it in production — they don't know what questions to ask yet.
This is an unknown unknowns problem. Traditional observability will show you uptime and error rates but will not reveal user intent, sentiment, or emergent usage patterns. Apply the unknown unknowns branch of the Known Unknowns vs. Unknown Unknowns principle: run lightweight LLM-based embedding and clustering over production traces to surface topic clusters, identify what users are actually asking, detect sentiment, and find failure modes that were not anticipated at build time. Wealth advisors should be included as annotators per the Dual Persona Requirement to validate whether the clustered topics represent real quality issues.
// What mistakes should you avoid when implementing agent observability with this framework?
- Assuming that an existing Grafana or Datadog contract solves the agent observability problem — it handles technical observability only, not functional observability.
- Excluding domain experts (clinicians, lawyers, wealth advisors) from the observability workflow because it 'feels too technical' — this is the exact population that can evaluate agent quality and their participation is what separates good teams from average ones.
- Collecting human annotation grades without capturing written justifications — the justifications are the mechanism by which you build scalable automated scoring; grades alone do not transfer.
- Underestimating the data infrastructure requirements of agent traces — treating them like standard log or metric data will cause ingestion, query performance, and real-time visibility failures at scale.
- Ignoring the full-text search requirement across trace content — without text-based indexing, you cannot answer basic operational questions like 'show me every trace where the agent mentioned a specific topic or entity'.
- Conflating observability and evals as separate systems requiring separate infrastructure — they are the same problem solved by the same system; the only difference is batch-vs-realtime and known-vs-unknown inputs.
- Treating agent observability as purely a technical persona problem — if only engineers are looking at traces, you are missing the qualitative signal that domain experts can provide.
// What key terms should you know for the Hetzel Agent Observability Differentiation Framework?
- Agent Observability
- The practice of monitoring, measuring, and improving the quality and behavior of AI agent systems in production — encompassing both technical metrics and functional/qualitative properties of agent outputs.
- Traditional Observability
- Established monitoring practice focused exclusively on uptime and technical performance — latency, error counts (400/500 level), duration — using tools like Grafana and Datadog. Answers the question: is the system operational?
- Functional Observability
- The qualitative layer of agent observability: was the agent's response grounded in retrieved context, did it use the expected tools, was it aligned to the brand standard set in the system prompt? Requires purpose-built agent tooling.
- Technical Observability
- The metrics-level layer of observability applicable to both traditional and agent systems: latency, time to first token, total tokens, duration, cache hits, error rates. Comes 'on the house' when an agent is properly traced.
- Agent Traces Are Nasty
- The characterization of agent trace data as highly semi-structured, containing large volumes of unstructured text, potentially exceeding a gigabyte per trace or 20 megabytes per span, requiring specialized database infrastructure to ingest, index, and query in real time.
- Non-Deterministic
- The property of agent/LLM systems where the reasoning path and output vary across identical or similar inputs due to the high variety and abstraction of language models — contrasted with the deterministic, known control flow of traditional applications.
- Known Unknowns
- Anticipated failure modes in agent behavior that can be defined in advance and measured with automated scoring functions (e.g., groundedness checks, tool-use verification).
- Unknown Unknowns
- Emergent patterns, failure modes, and usage behaviors in production agent traces that were not anticipated at build time — surfaced through LLM-based embedding, clustering, and topic modeling over production traces.
- Human Annotation
- The workflow of having domain experts (not just engineers) review production agent traces, assign quality grades, and write justifications for those grades — which then seed the development of scalable automated scoring functions.
- Dual Persona Requirement
- The principle that effective agent observability requires both technical personnel and non-technical domain experts (clinicians, lawyers, wealth advisors, etc.) to participate, because domain experts are closest to the users or problem space and can evaluate qualitative agent quality.
- Iteration Loop
- The cycle between detecting a problem in production observability traces and validating a fix through offline experimentation/evals — the goal of agent observability infrastructure is to make this loop faster and more direct.
- Write-Ahead Log
- A database mechanism used in agent trace infrastructure to immediately persist incoming trace data so users can see interactions in true real time as they occur.
- Tantivy Index
- A full-text indexing approach (based on a forked open-source Rust framework similar to Apache Lucene) required for agent observability databases to support text-based search queries across unstructured content within traces.
- Trace
- A complete record of a full agent interaction or workflow, from start to finish.
- Span
- A single step within a trace — for example, one model call or one tool call within a larger agent interaction.
// FREQUENTLY ASKED QUESTIONS
What is the Hetzel Agent Observability Differentiation Framework?
The Hetzel Agent Observability Differentiation Framework is a diagnostic method for determining whether an AI agent system requires traditional observability tooling, agent-specific observability, or both. It separates technical metrics (latency, errors, uptime) from functional metrics (groundedness, tool-use correctness, brand alignment) and identifies which stakeholder personas — engineers and domain experts — need access to each observability layer.
What is agent observability and how is it different from traditional monitoring?
Agent observability is the practice of monitoring both the technical performance and the qualitative behavior of AI agent systems in production. Traditional monitoring focuses exclusively on uptime, latency, and error rates. Agent observability adds a functional layer: was the agent's output grounded in retrieved context, did it use the right tools, and was the response aligned to the brand standard? Traditional tools like Datadog cannot evaluate these semantic and reasoning-level properties.
How do I decide if my AI agent needs special observability tooling?
Start by classifying whether your system follows deterministic code paths or uses an LLM with variable reasoning. If it's deterministic, traditional observability may suffice. If it involves an LLM agent, you need purpose-built agent observability for functional metrics like groundedness and tool-use correctness. Then assess trace data size — agent traces can exceed a gigabyte each — and identify whether non-technical domain experts need to review outputs.
How do I set up observability for an LLM-based agent system?
First, layer traditional tools (Datadog, Grafana) for technical metrics like latency and error rates. Then add agent-specific tooling for functional observability — groundedness checks, tool-use verification, and brand alignment scoring. Implement a human annotation workflow where domain experts grade traces with written justifications. Use those justifications to build automated scoring functions. Finally, add clustering and topic modeling over production traces to surface unknown failure modes.
How does the Hetzel framework compare to just using Datadog or Grafana for AI agents?
Datadog and Grafana handle technical observability — uptime, latency, 400/500 errors — but cannot evaluate the qualitative properties of agent outputs. The Hetzel framework doesn't replace these tools; it layers agent-specific functional observability on top. It adds groundedness checks, tool-use verification, brand alignment scoring, full-text search across trace content, and a human annotation workflow with domain experts. The two approaches are complementary, not competitive.
When should I use the Hetzel Agent Observability Differentiation Framework?
Use it whenever you are designing, auditing, or advising on the observability strategy for any AI agent system. It's especially useful when stakeholders ask why existing tools like Grafana or Datadog feel insufficient for agent workloads, when you're building a new agent and need to plan the monitoring stack, or when domain experts (clinicians, lawyers, wealth advisors) need to participate in quality evaluation of agent outputs.
What results can I expect after applying the Hetzel framework to my agent system?
You'll have a clear, layered observability architecture that covers both technical and functional quality. Engineers get latency, error rates, and uptime dashboards. Domain experts get trace review workflows and quality scoring. You'll surface both anticipated failure modes (known unknowns) and emergent issues (unknown unknowns). The iteration loop between detecting problems in production and validating fixes in experimentation becomes dramatically shorter.
Why do domain experts need to be involved in agent observability?
Domain experts — clinicians, lawyers, wealth advisors — are closest to the users and the problem space. They can evaluate qualitative agent quality that engineers cannot: whether clinical recommendations are appropriate, legal advice is accurate, or financial guidance is suitable. Their graded trace reviews with written justifications become the raw material for building scalable automated scoring functions. Excluding them means missing the most important quality signals.
What are known unknowns vs unknown unknowns in agent observability?
Known unknowns are failure modes you anticipate in advance — like checking whether the agent grounded its response in retrieved context or used the correct tools. You build automated scoring functions for these. Unknown unknowns are emergent patterns you didn't predict — discovered by running LLM-based clustering and topic modeling over production traces to surface unexpected user intent, sentiment, and failure modes.
Why are agent traces harder to handle than normal application logs?
Agent traces are highly semi-structured and contain massive volumes of unstructured text. A single trace can exceed a gigabyte, with individual spans reaching 20 megabytes. They require full-text indexing, write-ahead log ingestion for real-time visibility, and support for three distinct read patterns: real-time streaming, SQL analytical queries, and full-text search. Traditional observability databases were not designed for this combination of requirements.
Are observability and evals the same thing for AI agents?
Yes — observability and evals are the same underlying problem solved by the same system. The only difference is timing and input knowledge. With evals, you know the inputs ahead of time and run scoring functions in batch mode. With observability, inputs are unknown and arrive in real time from production. Building separate infrastructure for each is a common and costly mistake. A unified system handles both efficiently.
Turn Any YouTube Video Into An AI Skill
SkillForge captures a creator's exact methodology from their video and turns it into a reusable AI skill you can invoke in Claude, ChatGPT, or any LLM.
Forge your own skill