Durable Sessions vs Eval Maturity: Which Framework Do You Need?

// TL;DR

These frameworks solve completely different problems and are not substitutes for each other. Use Christensen's Durable Sessions Framework when your AI product's streaming, connectivity, or multi-device experience is breaking under real-world conditions. Use Hetzel's Eval Maturity Phases Framework when you need to systematically measure and improve your LLM or agent's output quality before or during production. Most teams building production AI products will need both — but if you're stuck choosing where to invest first, start with evals (Hetzel) because you cannot fix UX delivery for an agent that gives bad answers.

// HOW DO THEY COMPARE?

DimensionChristensen Durable Sessions AI UX FrameworkHetzel Eval Maturity Phases Framework
Best forFixing broken streaming, disconnections, multi-device sync, and agent control in AI chat productsBuilding and maturing a structured evaluation system for LLM/agent output quality
Problem domainAI UX infrastructure and real-time delivery architectureAI quality assurance, scoring, and continuous improvement
ComplexityHigh — requires re-architecting streaming layer, introducing pub/sub, and changing transport protocolsIncremental — four maturity phases let you start with vibe checks and scale gradually
Time to applyWeeks to months — infrastructure-level changes to streaming and session managementHours to days for initial phases; weeks for full flywheel and advanced techniques
PrerequisitesExisting AI chat or agent product with a streaming architecture (SSE, WebSockets, etc.)Any LLM-powered agent or prompt; ideally some production or UAT traces
Output typeArchitectural redesign: a Durable Sessions layer with pub/sub channels between agents and clientsEval system: datasets, scoring functions, LLM-as-judge configurations, and a production flywheel
Creator backgroundMike Christensen, Ably — real-time infrastructure and messaging platform expertisePhil Hetzel, Braintrust — LLM evaluation platform and AI quality tooling expertise
Team skillset requiredBackend/infrastructure engineers comfortable with pub/sub, WebSockets, and session state managementAny AI/ML engineer or product team; domain SMEs needed for annotation phases
When to skipSkip if your product is single-device, tolerates dropped streams, and needs no live agent controlSkip only if your agent is purely experimental with no path to production (but even then, vibe checking helps)
Scaling modelInfrastructure scales via pub/sub channels and session persistence — one-time architectural investmentEval effort scales through automation: LLM-as-judge, topic modelling, and CI-integrated eval pipelines

What does the Christensen Durable Sessions AI UX Framework do?

Mike Christensen's framework diagnoses why AI chat and agent products break under real-world conditions — network drops, multi-device usage, and users wanting to interrupt or steer agents mid-generation. The core insight is that most AI products use direct HTTP streaming (typically SSE), which couples the health of the response stream to a single client connection. When that connection drops, the stream is gone.

The framework introduces Durable Sessions: a persistent, shared layer between agents and clients built on pub/sub channels. Agents write events to the session; clients subscribe to the session. This architectural inversion unlocks three foundational capabilities: Resilient Delivery (streams survive disconnections), Continuity Across Surfaces (sessions follow users across tabs and devices), and Live Control (clients can steer, interrupt, or cancel agents mid-generation).

The framework also addresses the SSE Resume-Cancel Conflict — where closing an SSE connection is ambiguous between "I disconnected" and "I pressed stop" — by requiring bidirectional transport like WebSockets. For multi-agent architectures, it eliminates the Orchestrator Dual-Purpose Problem by having sub-agents write directly to the shared session rather than relaying through a central orchestrator.

What does the Hetzel Eval Maturity Phases Framework do?

Phil Hetzel's framework provides a structured, stage-by-stage methodology for building and maturing an LLM evaluation system. It acknowledges that most teams are stuck doing informal "vibe checking" and gives them a concrete ladder to climb toward production-grade eval systems.

The four maturity phases are: (1) Just Getting Started — structured vibe checking with documented human annotation, (2) Measuring to Manage — deriving failure modes from annotations and building scoring functions (deterministic and LLM-as-judge), (3) Accounting for Complexity — handling tool calls, CRUD operations on external systems, and multi-step trace evaluation, and (4) Advanced Techniques — topic modelling across production traces and fully automated eval pipelines.

A critical principle is "eval the eval" — never trust LLM-as-judge outputs without validating them against human ground truth. The framework culminates in The Flywheel: a continuous loop where production traces surface failures, those failures become eval dataset entries, evals guide improvements, and improvements are measured in the next cycle.

How do Durable Sessions and Eval Maturity Phases compare?

These frameworks operate in entirely different layers of the AI product stack and are complementary, not competing.

Durable Sessions lives in the infrastructure and delivery layer. It answers: "How do I get the agent's output to the user reliably, across devices, with live interactivity?" It does not care whether the agent's answers are good — only that they arrive intact.

Eval Maturity Phases lives in the quality and correctness layer. It answers: "How do I know the agent's outputs are good enough for production, and how do I keep improving them?" It does not care how those outputs are delivered to the client.

Durable Sessions is a one-time architectural investment with high upfront complexity. Eval Maturity is an incremental, ongoing process that starts simple and scales. Durable Sessions requires backend infrastructure expertise; Eval Maturity requires domain knowledge and subject matter experts.

The only overlap is that both frameworks aim to bridge the gap between a fragile demo and a production-quality AI product — but they address completely different dimensions of that gap.

Which should you choose?

If your agent gives wrong answers, start with Hetzel's Eval Maturity Phases. No amount of resilient streaming helps if the content being streamed is incorrect, unsafe, or unreliable. Build your eval system, identify failure modes, close the flywheel loop, and get your agent's quality to a defensible level.

If your agent's answers are good but users complain about dropped responses, can't switch devices, or can't stop a runaway generation, apply Christensen's Durable Sessions Framework. Your problem is delivery infrastructure, not agent quality.

If you're building a serious production AI product, you need both. Eval Maturity ensures the agent is worth deploying; Durable Sessions ensures the deployment actually works for real users. Start with evals (the quality foundation), then layer on Durable Sessions as you scale to real-world usage patterns.

For teams at the proof-of-concept stage, Hetzel's framework is the clear first investment — you can start today with 10 examples and a subject matter expert. Christensen's framework becomes critical once you have real users on unreliable networks, multiple devices, or workflows that demand live agent control.

Can you use both frameworks together?

Absolutely — and you should. The ideal workflow is to use Eval Maturity Phases to ensure your agent produces high-quality outputs, then use Durable Sessions to ensure those outputs reach users reliably across every surface. The Durable Sessions architecture actually makes Hetzel's flywheel easier to implement: because all agent events flow through a persistent session layer, capturing production traces for your eval dataset becomes a natural byproduct of the infrastructure rather than a separate instrumentation effort. Teams that adopt both frameworks build AI products that are correct and resilient — the two prerequisites for user trust.

// FREQUENTLY ASKED QUESTIONS

Do I need Durable Sessions or Eval Maturity Phases for my AI chatbot?

It depends on your primary problem. If users report dropped responses, can't switch devices, or can't stop generation, use Durable Sessions. If your agent gives wrong, hallucinated, or inconsistent answers, use Eval Maturity Phases. Most production chatbots eventually need both frameworks applied in sequence — evals first, then delivery infrastructure.

Can I use Christensen's Durable Sessions framework without changing my SSE streaming setup?

Partially. You can add resilient delivery and cross-surface continuity with a session layer on top of SSE. However, if you need live control — stop buttons, steering messages, or mid-generation interrupts — you must replace SSE with a bidirectional transport like WebSockets because SSE cannot distinguish between a disconnect and a user-initiated cancel.

What is the fastest way to start evaluating my LLM agent?

Use Hetzel's Phase 1: pick 10–20 representative inputs, have a subject matter expert review each output with a thumbs up/down plus a written justification explaining why. This structured vibe check takes hours, not weeks, and the justifications become the raw material for building automated scoring functions in later phases.

What is LLM-as-judge and can I trust it?

LLM-as-judge uses a separate LLM to score your agent's outputs. You should not trust it blindly. Hetzel's framework requires you to 'eval the eval' — validate judge outputs against a human-labelled ground truth dataset. Only once the judge demonstrates acceptable alignment with human expert decisions should you use it to score at scale.

What is a Durable Session in AI product architecture?

A Durable Session is a persistent, shared resource sitting between agents and clients. Agents write events to it; clients subscribe to it. Messages outlive any individual connection, device, or agent instance. It is typically implemented using pub/sub channels and enables resilient delivery, multi-device continuity, and live agent control simultaneously.

Which framework should I prioritize if I'm building a multi-agent system?

Start with Eval Maturity Phases to ensure each agent and the orchestrator produce quality outputs — multi-agent systems multiply failure modes. Then apply Durable Sessions to solve the Orchestrator Dual-Purpose Problem, where sub-agents write directly to the session channel instead of relaying through the orchestrator, simplifying architecture and enabling real-time multi-agent visibility for users.

How long does it take to implement Durable Sessions vs an eval system?

Durable Sessions is an infrastructure-level change requiring weeks to months — you're re-architecting your streaming layer and introducing a pub/sub session substrate. An eval system can start producing value in hours (Phase 1 vibe checking) and scales incrementally over weeks. Evals offer a much faster time-to-first-value.

Are Durable Sessions and Eval Maturity competing frameworks?

No. They solve completely different problems in different layers of the stack. Durable Sessions fixes how agent outputs are delivered to users. Eval Maturity fixes the quality of those outputs. They are complementary — production AI products benefit from applying both, typically starting with evals to ensure quality, then adding Durable Sessions for reliable delivery.