Harness Engineering vs Durable Sessions: Which Should You Use?

// TL;DR

These frameworks solve completely different problems and are complementary, not competing. Use Harness Engineering (Skill A) when your AI agents are unreliable — skipping steps, hallucinating completion, or producing unverifiable output. Use Durable Sessions (Skill B) when your AI product's streaming UX breaks under real-world conditions like disconnections, multi-device usage, or lack of user control. Most teams building production agent systems will eventually need both: Harness Engineering for agent correctness, Durable Sessions for delivery resilience.

// HOW DO THEY COMPARE?

DimensionNick Nisi Harness Engineering for AI AgentsChristensen Durable Sessions AI UX Framework
Best forEnsuring AI agents complete tasks correctly and provablyEnsuring AI-generated responses reach users reliably across surfaces
Core problem solvedAgent unreliability: hallucinated completion, skipped steps, unverified outputDelivery fragility: dropped streams, no multi-device sync, no live control
Architecture layerAgent orchestration and execution pipeline (backend)Streaming delivery and client connectivity (infrastructure/transport)
ComplexityHigh — requires state machine, eval suite, retrospective agent, evidence artifactsMedium — requires pub/sub session layer, transport swap from SSE to WebSockets
Time to applyDays to weeks for full harness; hours for targeted gotcha filesHours to days for session layer integration; minutes for architectural audit
PrerequisitesExisting agent system, CI/CD pipeline, ability to run evals, test infrastructureExisting AI chat or streaming product, understanding of current transport layer
Output typeVerified PRs with evidence artifacts, memory files, eval scoresResilient streaming architecture, multi-surface session layer, bidirectional transport
Creator backgroundNick Nisi — developer tooling, TypeScript, agentic engineering pipelinesMike Christensen (Ably) — real-time infrastructure, streaming architecture, AI UX
Key enforcement mechanismState machine gates with cryptographic evidenceDurable pub/sub sessions decoupling agent from client
Multi-agent supportFive-agent pipeline (Implementer, Verifier, Reviewer, Closer, Retrospective)Any number of agents writing to shared session; eliminates orchestrator relay bottleneck

What does Harness Engineering for AI Agents do?

Nick Nisi's Harness Engineering framework addresses a specific, painful problem: AI agents that claim to have completed work but actually skipped steps, hallucinated results, or produced unverifiable output. The framework replaces trust in agent self-reports with structural enforcement.

At its core, the framework wraps agent execution in a TypeScript state machine that orchestrates five specialized agents — Implementer, Verifier, Reviewer, Closer, and Retrospective — with hard gates between each stage. These gates are not prompt instructions the agent can ignore; they are code-level checkpoints that structurally block pipeline advancement until cryptographic or mechanical evidence proves the required work was done.

Key innovations include SHA-256 hashed test output as completion proof, Playwright before/after videos for UI fixes, failure-driven memory files that accumulate lessons from every run, and mandatory retrospective analysis that treats every failure as a harness bug rather than an agent bug. The framework is grounded in measurement: evals run before and after every change, and trust is defined as a pass rate — never a subjective feeling.

What does the Durable Sessions AI UX Framework do?

Mike Christensen's Durable Sessions framework tackles a completely different layer of the AI product stack: the delivery and connectivity infrastructure between agents and users. It diagnoses why AI chat experiences break under real-world conditions and provides an architectural pattern to fix them.

The framework identifies three foundational capabilities that separate a fragile demo from a production-quality AI product: Resilient Delivery (streams survive disconnections), Continuity Across Surfaces (sessions follow users across tabs and devices), and Live Control (users can steer, interrupt, or cancel agents mid-generation).

The core architectural move is introducing a Durable Session — a persistent, stateful, shared resource sitting between the agent layer and client layer. Agents publish events to the session; clients subscribe to it. Neither holds a direct connection to the other. This eliminates the Single-Connection Trap where stream health is coupled to one client's connection, resolves the SSE Resume-Cancel Conflict where closing a connection is ambiguous between disconnect and cancel, and solves the Orchestrator Dual-Purpose Problem where orchestrators are forced to relay sub-agent updates.

How do they compare?

These two frameworks operate on entirely different layers of the AI product stack and solve fundamentally different problems. They do not compete — they are complementary.

Harness Engineering operates at the agent execution layer. It ensures the agent actually does correct work and can prove it. It is concerned with whether the code the agent wrote is right, whether the tests actually passed, and whether the agent learned from its mistakes. It is a backend engineering discipline.

Durable Sessions operates at the delivery and transport layer. It ensures that whatever the agent produces actually reaches the user reliably, across devices, with the ability to resume after disconnection and control the agent in real-time. It is an infrastructure and UX architecture discipline.

A useful analogy: Harness Engineering ensures the factory produces correct parts. Durable Sessions ensures those parts are reliably shipped to the customer regardless of logistics disruptions. You need both for a production system, but they solve independent problems.

Harness Engineering is clearly more complex to implement — it requires a full state machine, five coordinated agents, an eval suite, evidence artifact infrastructure, and retrospective memory management. Durable Sessions is a more contained architectural change, primarily involving introducing a pub/sub session layer and potentially swapping SSE for WebSockets.

Harness Engineering is the better choice if your agents produce unreliable or unverifiable work. Durable Sessions is the better choice if your agent output is correct but your users experience dropped streams, can't switch devices, or can't control the agent mid-response.

Which should you choose?

If your problem is agent reliability, choose Harness Engineering. Signs you need it: agents claim tasks are done but they aren't, test results are fabricated or skipped, code reviews reveal the agent didn't follow instructions, and you're manually patching agent output on every run. Harness Engineering will structurally prevent these failures through gates, evidence, and self-improving memory.

If your problem is delivery and UX fragility, choose Durable Sessions. Signs you need it: users lose responses when switching networks, there's no way to see an in-progress response on a second device, your stop button is unreliable or ambiguous, and your orchestrator code is bloated with relay logic for sub-agent updates.

If you're building a production AI agent product from scratch, plan for both. Implement Harness Engineering to guarantee your agents produce correct, verified work. Implement Durable Sessions to guarantee that work reaches your users reliably across all conditions. Start with whichever addresses your most acute pain point today.

Neither framework addresses model selection, prompt engineering fundamentals, or training data — both assume you already have a capable model and focus on the engineering infrastructure around it. This shared philosophy — that the gap between demo and production is infrastructure, not model quality — makes them natural partners in a mature AI engineering stack.

// FREQUENTLY ASKED QUESTIONS

Can I use Harness Engineering and Durable Sessions together?

Yes, and you should for production AI agent products. They operate on completely different layers — Harness Engineering ensures agents produce correct, verified work at the execution layer, while Durable Sessions ensures that work reliably reaches users at the delivery layer. They are complementary, not competing frameworks.

Do I need Harness Engineering if my AI agent already works most of the time?

Yes, if you need provable reliability. 'Most of the time' means you're trusting the agent's self-report. Harness Engineering replaces that trust with cryptographic evidence and mechanical verification. If your agent's pass rate matters — for production code, customer-facing output, or regulated environments — you need structural enforcement, not hope.

What is a Durable Session and how is it different from a WebSocket?

A Durable Session is a persistent, shared, resumable resource that sits between agents and clients. WebSockets are a transport protocol. A WebSocket alone doesn't solve multi-device visibility or automatic resume after disconnect. A Durable Session uses pub/sub channels (often over WebSockets) to decouple agents from clients entirely, enabling resilience, cross-surface continuity, and live control simultaneously.

Is Harness Engineering only for coding agents?

No, but it is most immediately applicable to engineering agents that produce verifiable artifacts like code, test results, and UI changes. The core principles — enforce don't instruct, replace trust with evidence, retrospective memory — apply to any agent performing multi-step tasks where completion can be mechanically verified. The evidence artifact format changes by domain.

Why can't I just use SSE for my AI streaming product?

SSE is one-directional, which creates the Resume-Cancel Conflict: closing an SSE connection is ambiguous between network disconnect and user-initiated cancel. You cannot have both resume and cancel functionality under SSE. If you need a stop button, steering messages, or follow-up prompts mid-generation, you need bidirectional transport like WebSockets plus a Durable Sessions layer.

How long does it take to implement Harness Engineering?

Full implementation with state machine, five agents, eval suite, and retrospective memory takes days to weeks depending on your existing infrastructure. However, you can start with high-impact pieces immediately — adding targeted gotcha files takes hours, and implementing a single evidence gate for test verification can be done in a day.

What is the Orchestrator Dual-Purpose Problem in multi-agent AI systems?

When an orchestrator agent must both coordinate sub-agent tasks and proxy their granular progress updates to clients, it becomes a bottleneck with unnecessarily complex relay logic. Durable Sessions solve this by letting each sub-agent publish updates directly to a shared session channel, freeing the orchestrator to focus solely on delegation and result aggregation.

Which framework should I learn first as an AI engineer?

Start with Harness Engineering if you're building agents that perform autonomous tasks — agent correctness is prerequisite to everything else. Start with Durable Sessions if you already have working agents and are building a user-facing AI chat product that needs to handle real-world network conditions and multi-device usage gracefully.