Frequently Asked Questions About Hetzel Agent Observability Differentiation Framework

22 answers covering everything from basics to advanced usage.

// Basics

What is functional observability for AI agents?

Functional observability is the qualitative layer of agent monitoring that evaluates whether the agent's response was grounded in retrieved context, whether it used the expected tools correctly, and whether it aligned with the brand standard defined in the system prompt. It requires purpose-built agent tooling because traditional observability platforms like Datadog and Grafana cannot perform semantic or reasoning-level evaluations of agent outputs.

What is the dual persona requirement in agent observability?

The dual persona requirement is the principle that effective agent observability must serve both technical personnel (engineers, SREs) and non-technical domain experts (clinicians, lawyers, wealth advisors). Engineers need latency dashboards and error monitoring. Domain experts need trace review interfaces where they can evaluate qualitative agent quality in natural language. Excluding either group leaves critical blind spots in your observability coverage.

What is a write-ahead log in the context of agent traces?

A write-ahead log (WAL) is a database mechanism that immediately persists incoming trace data before full processing completes, enabling true real-time visibility into agent interactions as they happen. This is critical for agent observability because teams need to see live interactions — not wait for batch processing — especially when debugging production issues or monitoring high-stakes agent deployments in domains like healthcare or finance.

Do I need agent observability if my LLM application is just a simple chatbot?

Yes, if your chatbot uses an LLM, it is non-deterministic and produces variable outputs. Even simple chatbots can produce ungrounded responses, deviate from brand tone, or surface unexpected failure modes. The scope of agent observability you need may be smaller — fewer automated scores, simpler annotation workflows — but the fundamental requirement for functional observability beyond just latency and error rates still applies. Start with the framework's classification step to confirm.

What's the difference between a trace and a span in agent observability?

A trace is a complete record of a full agent interaction or workflow from start to finish — the entire conversation or task completion. A span is a single step within that trace — for example, one LLM model call, one tool invocation, or one retrieval operation. Agent traces typically contain many spans, and individual spans can be very large (up to 20 megabytes) when they include full conversation context or retrieved documents.

// How To

How do I build automated scoring functions from human annotations?

Start by having domain experts review production agent traces, assign quality grades, and write detailed justifications for each grade. The justifications — not just the grades — are the critical raw material. Analyze the justifications to identify recurring failure patterns and quality criteria. Then codify those patterns into automated scoring functions (often LLM-as-judge evaluators) that can run at scale. Iterate by comparing automated scores against continued human annotations to calibrate accuracy.

How do I implement unknown-unknown detection for agent traces?

Run lightweight LLM-based embedding over production traces to generate vector representations of each interaction. Apply clustering algorithms to group similar traces together, then use topic modeling to label each cluster. This surfaces emergent patterns like unexpected user intents, recurring failure modes, sentiment shifts, and usage behaviors you didn't anticipate at build time. Review the clusters with domain experts to validate which represent genuine quality issues versus benign variation.

How do I set up a human annotation workflow for agent observability?

Create a trace review interface accessible to non-technical domain experts. Sample production traces — both randomly and targeting edge cases. Have domain experts assign quality grades on a defined scale and, critically, write justifications for every grade. Store justifications as structured data. Run regular annotation sessions (weekly or biweekly) and use inter-annotator agreement to measure consistency. Feed justifications into automated scoring function development.

How do I map read patterns for my agent observability database?

Identify three distinct read patterns: (1) real-time streaming — watching agent interactions as they happen, (2) SQL/analytical queries — running historical analysis and experimentation over past traces, and (3) full-text search — querying across unstructured text content within traces (e.g., 'show me every trace mentioning a specific drug name'). If your system requires all three, standard OLAP databases like ClickHouse or traditional observability backends will be insufficient without augmentation.

// Troubleshooting

My agent traces are too large and crashing our existing observability pipeline — what's going wrong?

Agent traces are fundamentally different from traditional application traces. A single agent trace can exceed a gigabyte, with individual spans reaching 20 megabytes due to embedded conversation context, retrieved documents, and tool outputs. Traditional observability pipelines were designed for small, structured metric and log data. You need a purpose-built agent trace database with write-ahead log ingestion, full-text indexing (like a Tantivy index), and infrastructure designed for semi-structured, text-heavy payloads.

Why can't my team find specific topics across agent traces using our current tools?

Traditional observability tools index structured metrics and tags, not unstructured text content within traces. To answer questions like 'show me every trace where the agent mentioned a specific entity or topic,' you need full-text indexing — similar to how search engines index web pages. This requires a Tantivy-style index or equivalent full-text search infrastructure layered onto your trace storage. Without it, basic operational questions about agent behavior remain unanswerable.

Our domain experts say the observability dashboards are too technical — how do we fix this?

This is a dual persona requirement failure. Domain experts (clinicians, lawyers, wealth advisors) need interfaces designed for natural language interaction with trace data — not Grafana dashboards with latency percentiles. Build or adopt tooling that lets non-technical users browse conversation traces, search by topic in natural language, and submit quality grades with written justifications. Separate the domain expert interface from the engineering dashboard entirely.

// Comparisons

How does the Hetzel framework compare to building custom observability from scratch?

The Hetzel framework gives you a structured diagnostic process so you don't reinvent the wheel or miss critical requirements. Building from scratch without a framework typically leads to gaps: teams over-invest in technical metrics while ignoring functional observability, exclude domain experts, underestimate trace data infrastructure needs, or build separate systems for observability and evals when one unified system would suffice. The framework ensures you address all layers systematically before writing code.

How does agent observability compare to traditional APM tools like New Relic or Datadog?

Traditional APM tools measure technical health: latency, throughput, error rates, and uptime. They work for deterministic applications with known code paths. Agent observability adds a qualitative layer that APM tools cannot provide: evaluating whether agent outputs are grounded, aligned to brand standards, and using tools correctly. APM tools remain valuable for the technical layer of agent systems — the Hetzel framework recommends keeping them and layering agent-specific tooling on top.

Is agent observability the same as LLM evaluation?

They are the same underlying problem solved by the same system. Evals use known inputs run in batch to validate agent behavior before deployment. Observability processes unknown, real-time inputs from production. The scoring functions, trace infrastructure, and quality criteria are identical. The Hetzel framework explicitly warns against building separate infrastructure for observability and evals — this creates redundancy and slows the iteration loop between detecting production issues and validating fixes.

// Advanced

How does this framework handle hybrid systems that are part deterministic and part agent-based?

The framework's first step is classifying the system's determinism profile. For hybrid systems, you apply traditional observability to the deterministic components and agent-specific observability to the LLM-based components. The two layers run in parallel. Technical metrics (latency, errors) apply to both. Functional metrics (groundedness, tool-use correctness, brand alignment) apply only to the agent components. The key is not conflating them or assuming one tooling approach covers both.

What database infrastructure do I need specifically for agent traces?

Agent traces require a database that handles three read patterns simultaneously: real-time streaming via write-ahead log ingestion, SQL/analytical queries for historical analysis, and full-text search via a Tantivy-style index or equivalent. Standard OLAP databases like ClickHouse handle analytical queries well but lack full-text indexing and real-time streaming. Traditional observability backends handle streaming but not full-text search. You need purpose-built or augmented infrastructure addressing all three.

How do I scale human annotation as agent interaction volume grows?

Human annotation doesn't need to cover every trace — it seeds automated scoring functions. Start with targeted sampling: random samples plus edge cases flagged by automated scores or clustering. As domain experts annotate and provide written justifications, build automated LLM-as-judge scorers that replicate their evaluation criteria. Gradually shift from high human annotation rates to primarily automated scoring with periodic human calibration. The justifications, not just grades, are what make this scaling possible.

Can I use the Hetzel framework for multi-agent systems?

Yes. Multi-agent systems amplify every challenge the framework addresses. Traces become larger and more complex with inter-agent communication spans. Non-determinism compounds as multiple LLMs interact. Functional observability must evaluate not just individual agent quality but coordination quality — did agents hand off correctly, did they use each other's outputs appropriately? Apply the framework to each agent individually and to the orchestration layer as a whole.

What happens if I skip the human annotation step?

Without human annotation, you cannot build reliable automated scoring functions for functional quality. You'll be limited to technical metrics and whatever automated checks you can define from first principles — which will miss domain-specific failure modes. The Hetzel framework's Human Annotation as a Scaling Seed principle exists because the most important quality criteria are discovered empirically by domain experts reviewing real production traces, not predicted theoretically by engineers.

How often should domain experts review agent traces?

Establish a regular cadence — weekly or biweekly annotation sessions are a practical starting point. In early production, increase frequency to surface failure modes quickly. As automated scoring functions mature and cover more known failure modes, shift human review toward validating automated scores, calibrating inter-annotator agreement, and investigating unknown unknowns surfaced by clustering. Never fully eliminate human review; it remains the calibration mechanism for your entire quality system.

What does the iteration loop between production and experimentation actually look like?

When a problem surfaces in production traces — through automated scores, human annotation, or clustering — the workflow should let you quickly add those problematic traces to an offline dataset, run evals against them in batch mode with different agent configurations, and validate whether proposed changes fix the issue. The goal is compressing the time between 'problem detected in production' and 'fix validated in experimentation' to hours rather than weeks. This requires unified infrastructure for both observability and evals.