How Does the Planner-Generator-Evaluator Architecture Work?
For AI researchers and prompt engineers studying agent architectures · Based on Anthropic Planner-Generator-Evaluator Long-Agent Framework
// TL;DR
For AI researchers and prompt engineers studying long-running agent architectures, the Planner-Generator-Evaluator framework offers a principled approach to multi-agent harness design grounded in the GAN-inspired insight that tuning a critic is more tractable than tuning self-criticism. The framework introduces key concepts — context rot, context anxiety, structured hand-offs, harness-model co-evolution, and contract negotiation — that formalize the engineering challenges of agents running over multi-hour time horizons and provide a systematic methodology for addressing them.
What theoretical foundations underpin the Planner-Generator-Evaluator architecture?
The architecture is grounded in the asymmetry between generation and evaluation, a principle borrowed from GANs. The core insight: tuning a standalone critic to be harsh is tractable; tuning a builder to be self-critical is not. This maps directly to observed LLM behavior — models exhibit sycophancy and generosity bias when evaluating their own output, even in structured coding tasks.
This asymmetry motivates the architectural decision to separate Generator and Evaluator into independent context windows. The Evaluator never sees the Generator's reasoning trace — only the output artifact. This prevents the Evaluator from rationalizing failures based on the Generator's narrative about its intentions, maintaining genuine adversarial pressure.
The Planner role exists to solve a different problem: error cascading over long time horizons. Over-specification at the planning stage causes mistakes to compound across every subsequent sprint. The Planner therefore operates at the highest abstraction level, outputting sprint-level feature lists without granular technical decisions.
How does the framework formalize context management for long-running agents?
The framework identifies two distinct failure modes in context management:
Context rot — coherence degrades gradually as an agent works deeper into its context window. This is a smooth degradation curve where outputs become less consistent without any sudden failure point.
Context anxiety — as a model approaches its context limit, it exhibits a qualitatively different behavior: rushing to finish tasks prematurely and incompletely. This is a step-function failure, not gradual.
These are model-generation-specific phenomena. The framework prescribes different interventions depending on which behaviors the current model exhibits:
- For severe context rot: fresh context windows per feature, with orientation via persistent artifacts (featurelist.json, progress files, init scripts)
- For context anxiety: proactive session management before the limit is approached
- For models with neither: longer continuous sessions with compaction
Critically, the framework argues that compaction (lossy summarization) does not equal coherence. File-system-based state management is preferred over relying on compacted context to preserve meaning across very long runs.
What is the contract negotiation mechanism and why does it matter?
The Generator-Evaluator contract is perhaps the most novel architectural element. Before any code is written, the Generator proposes what it will build and how it should be verified. The Evaluator responds via shared files on disk, pushing back on scope, identifying weak tests, and surfacing edge cases.
This mechanism solves a fundamental problem in agent evaluation: what constitutes 'done'? The original user prompt is too vague for grading. The Planner's spec is intentionally high-level. The contract creates a negotiated, specific, testable definition of done that both agents agree to — converting fuzzy requirements into granular assertions.
The target of 20–30 contract criteria per sprint is empirically motivated: fewer criteria produce vague critiques the Generator can't act on; more criteria create excessive overhead. Each criterion should be directly testable via the Evaluator's verification tools.
How does harness-model co-evolution work in practice?
The framework explicitly rejects the idea of a permanent harness. Instead, it proposes a co-evolution discipline:
1. Identify spiky behaviors of the current model generation — the specific failure modes that need scaffolding
2. Build scaffold components that compensate for each spike
3. After each model upgrade, run a simplified harness and compare output quality
4. Remove scaffold components that the new model has internalized
5. Add new components for any new spiky behaviors the upgrade introduced
This creates a systematic methodology for harness maintenance that avoids two failure modes: keeping unnecessary scaffolding (adding cost and complexity) and removing load-bearing scaffolding prematurely (causing quality regression).
The primary tool for this co-evolution is trace reading — going through full agent transcripts by hand to understand why the model made each decision. Only by empathizing with the model's reasoning can you determine which scaffold components are still load-bearing.
The next step for researchers is to implement a minimal harness for a controlled task, systematically vary one component at a time (Evaluator cadence, contract granularity, context management strategy), and measure the effect on output quality and token efficiency.
// FREQUENTLY ASKED QUESTIONS
How does the Generator-Evaluator dynamic differ from constitutional AI or RLHF?
Constitutional AI and RLHF operate at training time, shaping the model's weights toward desired behavior. The Generator-Evaluator dynamic operates at inference time, creating adversarial pressure through architectural separation — separate context windows, separate system prompts, and file-based communication. The Evaluator is not training the Generator; it is providing external verification that forces genuine quality improvement within a single deployment session.
What are the open research questions in this framework?
Key open questions include: optimal contract granularity as a function of task complexity, whether the Evaluator's verification can be partially automated through learned test generation, how to formally measure the adversarial pressure gap between shared-context and separate-context evaluation, and whether harness-model co-evolution can be automated rather than requiring manual trace reading. The relationship between context rot curves and model architecture is also under-studied.
Can the three-role architecture be extended to more than three roles?
Yes, though with diminishing returns. The three roles map to the minimal viable separation: strategic direction (Planner), execution (Generator), and verification (Evaluator). Additional roles — such as a Debugger, Architect, or UX Specialist — can be added as specialized Evaluators with domain-specific rubrics and tools. However, each additional role increases coordination overhead. The framework recommends starting with three and adding roles only when trace reading reveals consistent gaps that a specialized role would fill.