Durable Sessions vs Agent Observability: Which Do You Need?

// TL;DR

These two frameworks solve different problems and you likely need both. Start with Durable Sessions if users are experiencing broken streams, lost responses, or lack of multi-device continuity — it fixes the delivery layer. Start with Agent Observability if your AI agent is live but you cannot tell whether its outputs are actually good — it fixes the measurement layer. For teams launching a new AI product, build the Durable Sessions infrastructure first, then layer observability on top once traffic is flowing.

// HOW DO THEY COMPARE?

DimensionChristensen Durable Sessions AI UX FrameworkHetzel Agent Observability Differentiation Framework
Best forFixing broken streaming UX: disconnections, lost responses, no multi-device support, no stop buttonUnderstanding agent output quality: groundedness, tool usage, reasoning failures, unknown failure modes
Core problem solvedAgent-to-client delivery reliability and real-time controlAgent behavior measurement, quality scoring, and production diagnostics
Layer of the stackInfrastructure / transport layer between agents and clientsMonitoring / evaluation layer on top of agent execution
Complexity to implementHigh — requires replacing SSE with bidirectional transport, adding a pub/sub session layer, and rearchitecting agent outputMedium-to-high — requires trace ingestion infrastructure, human annotation workflows, and automated scoring pipelines
Time to applyDays to weeks for the architectural audit; weeks to months for full implementationHours for the diagnostic audit; weeks for full annotation and scoring pipeline
PrerequisitesAn existing AI chat or agent product with a streaming architecture (SSE, WebSockets, etc.)An existing AI agent in production (or near-production) with observable trace data
Output typeArchitecture redesign: session layer design, transport protocol selection, agent-client decoupling planObservability strategy: tooling recommendations, scoring functions, annotation workflows, persona mapping
Key stakeholdersFrontend engineers, backend/infra engineers, product designersML/AI engineers, domain experts (clinicians, lawyers, advisors), product managers
Creator backgroundMike Christensen (Ably) — real-time infrastructure and pub/sub messagingHetzel — agent observability tooling and evaluation systems
Relationship to existing toolsReplaces or augments Vercel AI SDK streaming, direct SSE, and raw WebSocket patternsComplements Datadog/Grafana for technical metrics; replaces them for functional/qualitative agent metrics

What does the Christensen Durable Sessions AI UX Framework do?

The Durable Sessions framework, created by Mike Christensen of Ably, diagnoses and fixes the delivery layer of AI chat and agent products. It identifies a core architectural flaw called the Single-Connection Trap: most AI products stream LLM responses over a direct HTTP connection (typically SSE), which means if the user's connection drops, the response is permanently lost. There is no resume, no multi-device visibility, and no reliable way for users to send control signals (like a stop button) back to the agent.

The framework introduces a Durable Session — a persistent, shared resource that sits between agents and clients. Agents write events to the session; clients subscribe to the session. Neither holds a direct pipe to the other. This architectural inversion unlocks three foundational capabilities: Resilient Delivery (streams survive disconnections), Continuity Across Surfaces (sessions follow users across tabs and devices), and Live Control (users can steer, interrupt, or cancel agents mid-generation).

The framework also solves the Orchestrator Dual-Purpose Problem in multi-agent systems, where the orchestrator is forced to relay sub-agent progress updates to clients. With Durable Sessions, every sub-agent writes directly to the shared session, eliminating the relay bottleneck.

What does the Hetzel Agent Observability Differentiation Framework do?

The Hetzel framework helps teams diagnose whether their AI agent system requires traditional observability, agent-specific observability, or both — and then design the right observability stack. Its central insight is that traditional tools like Datadog and Grafana solve technical observability (latency, error rates, uptime) but completely miss functional observability: was the agent's response grounded in retrieved context? Did it use the right tools? Was it aligned to brand standards? Was the reasoning path sound?

The framework introduces the Dual Persona Requirement — effective agent observability must include domain experts (clinicians, lawyers, wealth advisors), not just engineers, because only domain experts can evaluate qualitative agent quality. It establishes a human annotation workflow where domain experts grade traces and write justifications, which then seed scalable automated scoring functions.

Critically, the framework treats observability and evals as the same underlying system — the only difference is batch vs. real-time and known vs. unknown inputs. It also addresses unknown unknowns through LLM-based clustering and topic modeling over production traces to surface failure modes that were never anticipated.

How do they compare?

These frameworks operate at entirely different layers of the AI product stack and are complementary, not competitive.

Durable Sessions is an infrastructure framework. It rewires how agent outputs reach clients. It does not care about whether the agent's response was good — only that it was delivered reliably, visibly, and controllably. If your users complain about lost responses, broken stop buttons, or inability to see conversations on a second device, this is your framework.

Agent Observability is a measurement framework. It does not care about how the response was delivered — only whether it was correct, grounded, safe, and high-quality. If your agent is live but you have no idea whether it is producing good outputs, whether users are satisfied, or what failure modes exist in production, this is your framework.

The overlap is minimal. Durable Sessions might produce trace-like event streams that an observability system could consume, but the framework itself does not address quality measurement. Agent Observability might flag delivery failures as technical metrics, but it does not solve the underlying transport architecture.

One important distinction: Durable Sessions is primarily consumed by infrastructure and frontend engineers, while Agent Observability explicitly requires non-technical domain experts as first-class participants. The organizational change required is different.

Which should you choose?

If you are pre-launch or early-launch and your AI product's streaming UX breaks when users switch networks, open a second tab, or hit the stop button, start with Durable Sessions. You cannot meaningfully observe agent quality if the delivery layer is so fragile that users never see complete responses. Fix the pipe first.

If your AI product is live in production with stable delivery but you cannot answer the question "is our agent actually producing good outputs?", start with Agent Observability. You need functional scoring, human annotation workflows, and unknown-unknown discovery before you can systematically improve agent quality.

For most teams building serious AI products, the answer is both, sequentially. Build Durable Sessions infrastructure during the product engineering phase to ensure reliable, controllable, multi-surface delivery. Then layer Agent Observability on top once you have real production traffic to measure.

If you have a multi-agent architecture with an orchestrator bottleneck, Durable Sessions is clearly the higher-priority framework — the Orchestrator Dual-Purpose Problem directly degrades both UX and architecture. If you have a single-agent system with stable delivery but unknown quality, Agent Observability is clearly the priority.

Do not try to substitute one for the other. A perfectly observable agent with broken delivery is useless to users. A perfectly delivered agent with no observability will silently degrade in quality with no mechanism for detection or correction.

// FREQUENTLY ASKED QUESTIONS

Can I use Durable Sessions and Agent Observability together?

Yes, and you should. They solve different problems at different layers. Durable Sessions fixes the delivery infrastructure between agents and clients. Agent Observability measures whether the agent's outputs are actually good. Most production AI products need both. Build Durable Sessions first for reliable delivery, then layer observability on top once you have stable production traffic to analyze.

Do I need Durable Sessions if I'm using the Vercel AI SDK?

Likely yes. The Vercel AI SDK streams via SSE, which creates the Single-Connection Trap. If a user's connection drops, the response is lost. SSE also cannot support a true stop button without ambiguity — closing the connection could mean 'cancel' or 'I disconnected.' If your product needs resilient delivery, multi-device continuity, or live control, you need to augment or replace the default SSE streaming.

Is Datadog or Grafana enough to monitor an AI agent in production?

No. Datadog and Grafana handle technical observability — latency, error rates, uptime — but cannot evaluate functional quality like groundedness, correct tool usage, or reasoning accuracy. The Hetzel framework makes clear that agent observability requires purpose-built tooling for the functional layer. Keep Datadog or Grafana for technical metrics, but add agent-specific observability for quality measurement.

Which framework should I apply first when building a new AI product?

Start with Durable Sessions. If your delivery layer is broken — users lose responses on disconnect, cannot see conversations across devices, or have an unreliable stop button — you cannot meaningfully measure agent quality because users never receive complete outputs. Fix the infrastructure first, then layer Agent Observability on top once production traffic is flowing.

Do I need domain experts for either of these frameworks?

Only for Agent Observability. The Hetzel framework's Dual Persona Requirement explicitly mandates that domain experts — clinicians, lawyers, wealth advisors — participate in trace review and annotation. Their qualitative judgments seed automated scoring functions. Durable Sessions is primarily an engineering framework consumed by infrastructure and frontend engineers.

Does Agent Observability replace the need for evals?

No — it unifies them. The Hetzel framework treats observability and evals as the same underlying system. Evals use known inputs in batch; observability processes unknown inputs in real time. Both use the same trace infrastructure, scoring functions, and annotation workflows. Building one well gives you the other almost for free.

What if I have a multi-agent system with an orchestrator?

Apply both frameworks. Durable Sessions solves the Orchestrator Dual-Purpose Problem by letting sub-agents write directly to a shared session, eliminating the orchestrator's relay bottleneck. Agent Observability then measures the quality of each sub-agent's outputs independently. Together they give you both reliable multi-agent delivery and per-agent quality visibility.

How long does it take to implement each framework?

Durable Sessions requires a significant architectural change — replacing SSE with bidirectional transport and adding a pub/sub session layer — typically taking weeks to months for full implementation. Agent Observability's diagnostic audit can be done in hours, but building the full stack (trace infrastructure, annotation workflows, automated scoring) takes weeks. Both are ongoing investments, not one-time projects.