Durable Sessions vs Planner-Generator-Evaluator: Which?

// TL;DR

These frameworks solve completely different problems and rarely compete. If your AI product's streaming architecture breaks on disconnects, multi-device use, or lacks a stop button, use the Christensen Durable Sessions Framework. If you're building autonomous agents that run for hours and produce complex artifacts like full applications, use Anthropic's Planner-Generator-Evaluator Framework. Most teams shipping a user-facing AI chat product should start with Durable Sessions — it fixes the gap users feel first. Teams building autonomous code-generation or creative pipelines need PGE.

// HOW DO THEY COMPARE?

DimensionChristensen Durable Sessions AI UX FrameworkAnthropic Planner-Generator-Evaluator Long-Agent Framework
Best forFixing broken AI chat/streaming UX — disconnects, multi-device sync, live agent controlBuilding autonomous agents that run for hours producing complex artifacts (apps, codebases)
Problem domainReal-time delivery and connectivity infrastructure between agents and clientsAgent orchestration, long-running task coherence, and output quality assurance
Complexity to implementMedium — requires pub/sub infrastructure and transport layer changes (e.g. SSE → WebSockets)High — requires building a multi-agent harness with separate context windows, file-based contracts, live verification tools, and ongoing prompt tuning
Time to applyDays to weeks — architectural audit + session layer integrationWeeks to months — harness design, rubric calibration, trace reading, model-specific tuning
PrerequisitesAn existing AI product with streaming responses and identifiable UX failures (dropped streams, no multi-device)A defined artifact to build, a quality rubric, access to capable models (Opus-class for planning), and verification tooling (Playwright, computer use)
Output typeA resilient, multi-surface, controllable real-time session infrastructure layerA production-grade artifact (app, codebase, pipeline) built autonomously over hours
Creator backgroundMike Christensen (Ably) — real-time infrastructure and pub/sub expertAsh Prabaker & Andrew Wilson (Anthropic) — AI agent research and long-running agent systems
Multi-agent handlingSolves the delivery side — all agents write to one Durable Session, clients see everythingSolves the orchestration side — Planner decomposes work, Generator builds, Evaluator critiques adversarially
Ongoing maintenanceLow once session layer is in place — infrastructure is model-agnosticHigh — harness must be re-tuned with every major model release as spiky behaviours shift
Key risk if ignoredUsers experience broken streams, can't use stop buttons, can't switch devices — product feels fragileAgents drift, rubber-stamp bad output, lose coherence after 30+ minutes — artifacts are unusable

What does the Christensen Durable Sessions Framework do?

The Christensen Durable Sessions Framework diagnoses and fixes the infrastructure layer between your AI agents and your users. It identifies a core failure it calls the Single-Connection Trap: most AI products stream responses over a direct HTTP connection (typically SSE), so if the user's connection drops, the response is gone. A second device or tab can't see the live response. The stop button is ambiguous — closing an SSE connection could mean "I disconnected" or "please cancel."

The framework solves this by introducing a Durable Session — a persistent, shared pub/sub channel that sits between agents and clients. Agents write events to the session; clients subscribe to the session. Neither holds a direct pipe to the other. This single architectural inversion unlocks three foundational capabilities: Resilient Delivery (streams survive disconnects), Continuity Across Surfaces (sessions follow users across tabs and devices), and Live Control (clients can steer, interrupt, or cancel agents mid-generation via bidirectional transport).

This framework is entirely about the delivery and connectivity layer. It does not address what the agent produces — only how that output reaches the user reliably.

What does the Anthropic Planner-Generator-Evaluator Framework do?

The Anthropic Planner-Generator-Evaluator (PGE) Framework is a harness design pattern for autonomous agents that run for hours building complex artifacts — full applications, codebases, or data pipelines. Its core insight is that models cannot reliably judge their own output: a builder agent reviewing its own work will call a half-baked feature done.

PGE splits the work into three roles with separate context windows: a Planner that produces a high-level, sprint-level spec; a Generator that builds one feature at a time; and an Evaluator that adversarially critiques the output using live verification tools like Playwright. Before any code is written, the Generator and Evaluator negotiate a contract — a file-based agreement defining exactly what "done" means — so grading is precise, not vague.

The framework also addresses context rot (coherence degrades deep into a session) and context anxiety (models rush to finish near the context limit), using file-system-based state and structured hand-offs to manage long time horizons. Critically, the harness is designed to evolve: as models improve, scaffold components that compensated for earlier model weaknesses should be identified and removed.

How do they compare?

These two frameworks operate at completely different layers of the AI product stack, and understanding this is the most important takeaway.

Durable Sessions operates at the delivery layer. It answers: "How do agent outputs reach users reliably across connections, devices, and control interactions?" It is model-agnostic, infrastructure-focused, and relevant to any product with a real-time AI chat or streaming interface.

PGE operates at the orchestration and quality layer. It answers: "How do we structure multiple agent roles to produce high-quality artifacts over multi-hour autonomous runs?" It is deeply model-aware, prompt-engineering-intensive, and relevant to products where agents build things autonomously.

They are complementary, not competitive. A system using PGE to generate a full-stack application over four hours could use Durable Sessions to stream all three agents' live progress to a user's dashboard across their laptop and phone. Durable Sessions would solve the delivery; PGE would solve the coherence and quality.

That said, there is one area of overlap: multi-agent architectures. Both frameworks address multi-agent systems, but from opposite angles. Durable Sessions solves the Orchestrator Dual-Purpose Problem — preventing the orchestrator from having to relay sub-agent updates to clients. PGE solves the self-evaluation trap — preventing a single agent from both building and judging. If you have multiple agents, you likely need both patterns.

Which should you choose?

Choose Durable Sessions if your users are experiencing a broken real-time AI experience: dropped streams, no multi-device continuity, an unreliable stop button, or sub-agent activity invisible to the client. This is the more common problem for teams shipping user-facing AI products today. It is faster to implement, lower maintenance, and immediately improves perceived product quality.

Choose PGE if you are building autonomous agent systems that run for extended periods — 30 minutes or more — and you need production-grade output quality. This is the right choice for AI-powered code generation, complex creative workflows, or any pipeline where the agent must maintain coherence and quality over a long time horizon. Expect significant upfront investment in rubric design, prompt tuning, and ongoing harness maintenance.

Choose both if you are building a product where long-running autonomous agents produce artifacts AND users need real-time visibility, multi-device access, and live control over those agents. The Durable Sessions layer handles delivery; PGE handles orchestration and quality. They do not conflict.

If you are unsure where to start: start with Durable Sessions. The gap between a fragile demo and a great AI product is almost always in the infrastructure first, not the agent logic. Once your delivery layer is solid, layering PGE on top for complex autonomous workflows becomes straightforward.

// FREQUENTLY ASKED QUESTIONS

Can I use Durable Sessions and Planner-Generator-Evaluator together?

Yes, and you should if you have long-running agents with user-facing streaming UIs. Durable Sessions handles the delivery layer — ensuring streams survive disconnects and work across devices. PGE handles the orchestration layer — ensuring agents maintain quality and coherence over hours. They operate at different layers and complement each other naturally.

Which framework fixes my AI chat app's broken stop button?

The Christensen Durable Sessions Framework. The stop button problem stems from SSE's Resume-Cancel Conflict: closing an SSE connection is ambiguous between a network disconnect and a user cancel. Durable Sessions replaces SSE with bidirectional transport (WebSockets) and explicit cancel signals through the session channel, resolving the ambiguity completely.

Why can't an AI agent just review its own code output?

The Anthropic PGE Framework explains this as the self-evaluation trap. Models exhibit sycophancy and generosity bias when judging their own work — the same way LLM-as-judge systems are biased. A builder will call a half-implemented feature done. Using a separate Evaluator agent in its own context window with an adversarially-tuned prompt produces genuinely critical, actionable feedback.

Do I need Durable Sessions if I'm already using WebSockets?

Likely yes. WebSockets give you bidirectionality, which solves the live control problem, but they don't automatically solve multi-device visibility or stream resumability. A Durable Sessions layer — a persistent, shared pub/sub channel — is still required so that multiple clients can subscribe to the same session and reconnect without data loss.

How long does it take to implement the Planner-Generator-Evaluator harness?

Expect weeks to months. The harness itself can be prototyped in days, but the real work is calibrating the Evaluator: writing granular rubrics, providing few-shot examples of good and bad output, reading agent transcripts by hand to find judgment gaps, and tuning prompts. This calibration must be repeated after every major model release as failure modes shift.

What's the difference between context rot and context anxiety in long-running agents?

Context rot is the gradual degradation of coherence as an agent works deeper into its context window — output drifts and becomes inconsistent. Context anxiety is a distinct behaviour where, approaching the context limit, the model rushes to finish prematurely and incompletely. PGE addresses both through fresh context sessions, file-system-based state, and structured hand-offs.

Which framework should I use for a multi-agent AI product?

It depends on what's breaking. If sub-agent progress isn't reaching users, agents can't be interrupted, or the orchestrator is bottlenecked relaying updates — use Durable Sessions. If agents are producing low-quality output, losing coherence, or rubber-stamping their own work — use PGE. For complex multi-agent products with both delivery and quality problems, use both frameworks at their respective layers.

Does the Durable Sessions framework require a specific technology like Ably?

No. The framework defines architectural principles — persistent, independently addressable, resumable pub/sub channels between agents and clients. Ably is one implementation substrate (and the speaker's company), but any pub/sub system that supports persistence, sequence-based resumability, and bidirectional channels can serve as the foundation. The pattern is technology-agnostic.