How Do Platform Engineers Build Agent Observability?
For Platform engineers at AI-native startups · Based on Hetzel Agent Observability Differentiation Framework
// TL;DR
Platform engineers building AI agent infrastructure need more than Datadog and Grafana. The Hetzel Agent Observability Differentiation Framework helps you design an observability stack that handles semi-structured traces exceeding a gigabyte each, supports three read patterns (real-time streaming, SQL analytics, full-text search), and serves both engineering and non-technical domain expert personas. Use it to avoid building separate systems for observability and evals, and to ensure your trace database can handle the unique demands of LLM agent workloads.
Why Can't I Just Use Datadog for My Agent System?
Datadog, Grafana, and other traditional APM tools are excellent at what they were designed for: technical observability. Latency percentiles, error rates, uptime dashboards, throughput metrics — these remain critical for your agent system and you should keep using these tools for that layer.
But agent systems introduce a fundamentally new category of observability requirements that traditional tools cannot address. The Hetzel framework calls this functional observability: evaluating whether the agent's output was grounded in retrieved context, whether it used the correct tools, whether it adhered to the brand standard defined in the system prompt, and whether the response quality meets domain-specific criteria.
As a platform engineer, your job is to build infrastructure that serves both layers. The Hetzel framework's Scope Difference Principle makes this explicit: traditional observability answers "is the system up?" while agent observability answers "is the system producing quality outputs?"
What Makes Agent Trace Data So Different From Normal Logs?
This is where platform engineers feel the pain most directly. The Hetzel framework's Agent Traces Are Nasty principle describes the core challenge: agent traces are highly semi-structured, contain massive volumes of unstructured text, can exceed a gigabyte per trace with individual spans reaching 20 megabytes, and must be delivered in true real time.
Your existing observability pipeline — whether it's built on ClickHouse, Elasticsearch, or a managed service — was designed for structured metrics and moderate-sized log entries. Agent traces break these assumptions in three ways:
1. Volume per trace: Embedding full conversation context, retrieved documents, and tool outputs creates payloads orders of magnitude larger than traditional spans.
2. Three read patterns: Your consumers need real-time streaming (watching interactions live), SQL analytical queries (experimentation over historical traces), and full-text search (finding every trace that mentions a specific entity). No single traditional backend handles all three.
3. Write-ahead log ingestion: Domain experts and engineers need to see traces as they happen, not after a batch processing delay.
Design your trace database around these three read patterns from day one. Bolting on full-text search later is architecturally painful.
How Do I Unify Observability and Evals Instead of Building Two Systems?
The Hetzel framework's Observability and Evals Are the Same Problem principle should change how you architect your platform. The scoring functions, trace storage, quality criteria, and analysis workflows are identical for both. The only difference: evals use known inputs run in batch, while observability processes unknown inputs in real time.
Build one unified trace and scoring infrastructure. When a problem surfaces in production observability, your workflow should make it trivial to: (a) add those traces to an offline dataset, (b) run evals against them in batch with different agent configurations, and (c) push validated fixes back to production. This is the iteration loop — and compressing it from weeks to hours is the highest-leverage thing you can do as a platform engineer.
How Do I Support Non-Technical Domain Experts on the Platform?
The Dual Persona Requirement means your platform must serve clinicians, lawyers, wealth advisors, or whoever the domain experts are — not just your engineering team. Build separate interfaces: engineers get dashboards with latency percentiles and error breakdowns; domain experts get conversation trace browsers with natural language search and annotation tools.
The annotation workflow is infrastructure, not a nice-to-have. Domain experts need to assign quality grades and write justifications for each grade. Those justifications are the raw material for building automated scoring functions that scale. Design your data model to store justifications as structured, queryable data from the start.
Next step: Audit your current agent system against the Hetzel framework's 9-step workflow. Start with Step 1 (classify determinism profile) and Step 3 (assess trace data characteristics) — these two steps alone will reveal whether your current infrastructure is adequate or needs purpose-built augmentation.
// FREQUENTLY ASKED QUESTIONS
What database should I use for agent traces?
You need a database that supports three read patterns simultaneously: real-time streaming via write-ahead log, SQL/analytical queries for historical analysis, and full-text search via a Tantivy-style index. Standard OLAP databases like ClickHouse handle analytical queries but lack full-text indexing and streaming. Purpose-built agent trace databases or augmented architectures combining multiple engines are typically required.
Should I build agent observability infrastructure in-house or use a vendor?
Evaluate against the Hetzel framework's requirements: semi-structured trace ingestion at scale, three read patterns, human annotation workflows, automated scoring, and clustering for unknown unknowns. If a vendor covers all of these, use them. If you build in-house, the framework's 9-step workflow ensures you don't miss critical requirements like full-text search or the domain expert interface.
How do I handle the cost of storing massive agent traces?
Implement tiered storage: keep recent traces (days to weeks) in hot storage with full read pattern support, then archive older traces to cold storage with analytical query access only. Prioritize retaining traces flagged by automated scores or human annotators. The iteration loop requires fast access to problematic traces; routine traces can be sampled and archived.