Anthropic Planner-Generator-Evaluator Long-Agent Framework

Last updated: 21 May 2026

Design and operate AI agent harnesses that run coherently for hours without losing context, quality, or direction — producing production-grade outputs a solo agent loop cannot achieve.

// TL;DR

The Anthropic Planner-Generator-Evaluator Long-Agent Framework is a three-role harness architecture for running AI agents coherently over multi-hour sessions. A Planner decomposes a vague user prompt into high-level sprints, a Generator builds features one at a time, and an adversarial Evaluator grades output using live verification tools against a negotiated contract — not the Generator's self-assessment. Use it whenever you're building agent systems expected to run longer than 30 minutes, producing complex multi-feature applications autonomously, or diagnosing why a long-running agent is drifting, stalling, or rubber-stamping its own poor output.

Framework

// When should I use the Planner-Generator-Evaluator framework for long-running agents?

Use this skill whenever you are architecting an agent system expected to run longer than ~30 minutes, building complex multi-feature applications autonomously, or diagnosing why a long-running agent is drifting, stalling, or rubber-stamping its own poor output.

// What inputs do I need to set up a Planner-Generator-Evaluator agent harness?

task_promptrequired
The high-level, intentionally vague user request (e.g. 'build a retro game maker'). One line is fine — the Planner handles decomposition.
quality_rubricrequired
Your written, opinionated criteria for what 'good' looks like across 2-4 dimensions (e.g. Design, Originality, Craft, Functionality). Must be specific enough to produce granular, actionable critique.
model_selectionrequired
Which model(s) to use for each role. Drives harness design choices — capabilities and failure modes differ per model generation.
target_domainrequired
The type of artifact being built (web app, CLI tool, data pipeline, etc.) — determines which verification tools to deploy (e.g. Playwright MCP for web apps, computer use for native apps).
reference_examples
Few-shot examples of good and bad output that calibrate the Evaluator's taste toward yours (e.g. 'this is good design, this is AI slop').

// What are the core principles behind the Planner-Generator-Evaluator framework?

Self-Evaluation Is a Trap

Models cannot reliably judge their own output. The same sycophancy and generosity bias that appears in general LLM-as-judge systems applies equally to coding agents. A builder that reviews its own work will see a half-baked button and call it done. Always use an adversarial evaluator in a separate context window.

The Generator-Evaluator Gap

The key insight stolen from GANs: tuning a standalone critic to be harsh is tractable; tuning a builder to be self-critical is not. It is far easier to critique a meal than to cook one. Exploit this asymmetry by giving each role its own context window, system prompt, and job.

Contracts Over Specs

Before the Generator writes a single line, the Generator and Evaluator negotiate what 'done' means via files on disk — one writes the markdown, the other reads and pushes back. The Evaluator grades against this negotiated contract, not the original spec. This converts fuzzy user stories into granular, testable assertions without the Planner over-specifying upfront.

Context Rot and Context Anxiety

As a session deepens, coherence degrades (context rot). Near the context limit, models rush to finish prematurely (context anxiety). Design the harness to manage context deliberately — via compaction, fresh sessions per feature, or structured hand-offs — depending on which model generation you are running.

Structured Hand-offs and Clean Contexts

Lossy summaries drift. Instead of relying on compaction to preserve meaning across very long runs, use the file system as shared state. Persistent artifacts — progress files, feature lists as JSON (not markdown, which models overwrite), timestamped learnings logs — are more reliable than context-window memory.

Harness and Model Co-Evolve

The harness does not disappear as models improve — it evolves. Identify the spiky failure modes of each model generation and fill those gaps with scaffolding. As the model improves and internalises a harness behaviour, simplify or remove that scaffold component. The frontier moves; your harness should track it.

Taste Is Gradable

Subjective quality — design taste, originality, aesthetic — is gradable if you have a strong enough opinion and write it down. Calibrate the Evaluator with few-shot reference examples of good and bad output. Vague criteria produce vague critiques; granular criteria produce actionable fixes.

High-Level Planning Only

The Planner should output a deliberately high-level spec broken into sprints — never granular technical decisions. Over-specification at the planning stage causes errors to cascade and magnify over a multi-hour time horizon. Let the Generator and Evaluator resolve the technical details through their contract.

Read the Traces

The primary debugging loop for any long-running harness is reading agent transcripts by hand, line by line, not running more experiments. Only by empathising with the model — understanding why it made each decision — can you know which scaffold components to delete, adjust, or keep as the frontier moves.

// How do you apply the Planner-Generator-Evaluator framework step by step?

1
Initialise the Planner agent
Feed the one-line user prompt to a Planner agent (use your most capable planning model, e.g. Opus-class). Its output must be: (a) a sprint-level feature list saved as featurelist.json — JSON, not markdown, because models are less likely to overwrite JSON files; (b) a progress file tracking feature completion state; (c) a Git repo initialised with an init script so subsequent sessions do not re-derive setup steps. Keep the spec high-level — no granular technical decisions. The Planner's job is to set hard outer lines for the product, not micromanage implementation.
2
Assign roles and context windows
Instantiate three separate agents, each with its own context window, system prompt, and single responsibility: Planner (done after Step 1), Generator (builder/IC), Evaluator (critic/QA). Never share raw Generator context with the Evaluator — this muddies the adversarial pressure. The Evaluator should only see the output artifact, not the Generator's reasoning trace.
3
Run the Generator-Evaluator contract negotiation
Before any code is written, the Generator proposes what it will build and how it should be verified. The Evaluator responds via a shared file on disk — pushing back on scope, weak tests, or missed edge cases. They iterate via file read/write until both agree. The resulting contract is the ground truth for this sprint — not the original Planner spec. Target ~20-30 granular contract criteria for meaningful, actionable grading. Vague criteria produce vague critiques.
4
Execute the Generator build loop
The Generator picks one feature (only one) that has not yet passed all tests, orients itself using the progress file and init script, builds the feature, and runs verification. Use programmatic tool calling where possible to reduce context consumption. The Generator should write timestamped learnings and state to a JSON log file throughout — these are breadcrumbs for future sessions or human handoff.
5
Deploy the Evaluator with live verification tools
The Evaluator must actively use the artifact — launch Playwright MCP (for web apps) or computer use (for native apps) to open live pages, click around, and stress-test features. It grades against the negotiated contract across your rubric dimensions (e.g. Design, Originality, Craft, Functionality — weight toward the dimensions where the model is weakest, not functionality if the model already handles that well). Calibrate harshness with few-shot reference examples of good and bad output before deploying.
6
Handle the Evaluator feedback loop
The Evaluator writes its critique and score back to a shared file. If the Generator cannot hill-climb against the rubric after repeated attempts on a given criterion, the harness should discard the current attempt entirely and restart — not keep patching the same broken approach. This ability to course-correct over long time horizons is the core advantage over a RALF loop or single-session self-review.
7
Update the progress file and loop
If the feature passes all contract criteria, the Generator writes the Git commit and marks the feature as complete in featurelist.json. If unfinished features remain, the loop continues — either in the same session (with compaction, for capable models) or in a fresh context window (for models with severe context rot or anxiety). Choose based on the spiky behaviours of your current model generation.
8
Read traces and tune the harness prompts
After each run, read the full agent transcripts by hand. Find every point where the Evaluator's judgment diverged from yours. Treat this like reading a stack trace — empathise with why the model made each decision. Update the Evaluator's system prompt and rubric to close that gap. Optionally, pipe transcripts to a secondary agent to grep for patterns and suggest prompt updates. Only delete harness components when model improvements have rendered them redundant — run a simplified version and evaluate before committing.
9
Adapt the harness to the current model generation
Regularly reassess which scaffold components are still load-bearing. Examples: context-window resets between sessions may be critical for one model generation and unnecessary the next; sprint decomposition may be essential for weaker planners but removable for stronger agentic models; Evaluator cadence may shift from every sprint to end-of-full-generation. The harness is always right for a specific model — track model releases and strip components accordingly.

// What are real examples of the Planner-Generator-Evaluator framework in action?

A user provides the prompt 'build a full-stack project management tool' with no further specification.

The Planner receives the vague prompt and outputs a sprint-level featurelist.json covering major workflow areas (project creation, task assignment, progress tracking, notifications) without specifying technical stack details. The Generator and Evaluator negotiate a contract of ~25 criteria — e.g. 'drag-and-drop reordering must persist on reload; verify by dragging three tasks and refreshing'. The Generator builds one feature per loop iteration. The Evaluator launches Playwright, drags tasks, refreshes, and grades against the contract. When the Generator repeatedly fails to pass the 'real-time collaboration' criterion, the harness discards that attempt and restarts that sprint from scratch rather than patching. After 4-6 hours the progress file shows all features marked complete with passing tests.

A team wants to improve the visual design quality of AI-generated front-end UIs, which keep producing 'AI slop' aesthetics (purple gradients, generic layouts).

A rubric with four criteria is created: Design, Originality, Craft, Functionality — weighted heavily toward Design and Originality since the model already handles Functionality. The Evaluator is calibrated with few-shot reference screenshots labelled 'good design' and 'AI slop'. The Generator produces an HTML/CSS page; the Evaluator takes Playwright screenshots and scores across the rubric. If Originality scores consistently low across multiple rounds, the harness pivots — discards the current design direction entirely and restarts with a different generative seed — rather than iterating on the same failing aesthetic. After 5-15 rounds the output converges toward the rubric's defined taste.

An agent harness was built and tuned for a previous model generation and the team upgrades to a newer, more agentic model.

The team identifies the spiky behaviours of the new model: context anxiety is gone (no need for forced session resets), the model can hold coherence across a 2-hour continuous build (sprint-by-sprint decomposition less critical), and it is willing to discard its own work and restart when the rubric is not met. The team runs a simplified harness — single continuous session with compaction, Evaluator running at end-of-full-generation rather than per sprint — and compares output quality and cost. Scaffold components rendered redundant by the model upgrade are removed. The harness remains structurally the same (Planner-Generator-Evaluator) but with fewer moving parts.

// What are common pitfalls when building a Planner-Generator-Evaluator agent harness?

Self-evaluation is a trap: never instruct the Generator to review its own output in the same context window. The sycophancy and generosity bias is just as present in coding tasks as in conversational ones — it will call a half-implemented feature done.
Sharing the Generator's reasoning trace with the Evaluator muddies adversarial pressure. The Evaluator should see only the output artifact, not how it was built — otherwise the model kids itself that something is working based on the Generator's narrative.
Vague rubric criteria produce vague critiques. If the Evaluator's grading language is imprecise, the Generator shrugs and makes arbitrary changes. Force yourself to write down granular, opinionated criteria — 20-30 contract items per sprint is a reasonable target.
Over-specifying in the Planner causes cascading errors. If the Planner tries to define granular technical decisions, any mistake at that stage magnifies across every subsequent sprint over a multi-hour horizon. Keep the Planner high-level and let the Generator-Evaluator contract handle specifics.
Using markdown files for persistent state is risky — models tend to overwrite them. Use JSON files for feature lists, progress tracking, and learnings logs.
Context rot and context anxiety are model-specific. Applying a fresh-session-per-feature approach designed for a model with severe context rot to a newer model with strong coherence adds unnecessary complexity and cost. Reassess after every major model release.
Compaction does not equal coherence. Lossy summaries drift over very long runs. Do not assume compaction alone is sufficient for state management — use the file system as the source of truth for shared state.
Running more experiments instead of reading traces is a false shortcut. The primary debugging loop is reading agent transcripts by hand, line by line, to understand where the model's judgment diverged from yours. Only then can prompt tuning be precise.
Treating the harness as permanent is a mistake. Components that are load-bearing for one model generation may be redundant for the next. Actively hunt for scaffold components to delete as model capabilities improve.
Out of the box, LLMs make bad QA agents — they will find a bug and defer it ('fix in 2 weeks') rather than blocking on it. Significant prompt tuning effort is required to make the Evaluator genuinely harsh. Plan for this calibration work upfront.

// What do the key terms in the Planner-Generator-Evaluator framework mean?

Planner: The first agent in the three-role harness. Receives the vague user prompt and produces a high-level, sprint-level spec saved as persistent artifacts (featurelist.json, progress file, init script). Never plans granular technical details — its job is to set the hard outer lines of the product.
Generator: The builder/IC role in the harness. Operates in its own context window, picks one feature at a time from the feature list, implements it, and negotiates the definition-of-done contract with the Evaluator before writing any code.
Evaluator: The adversarial critic/QA role in the harness. Operates in its own context window with a separate, harshly-tuned system prompt. Uses live verification tools (e.g. Playwright MCP) to actively test the artifact — not just read diffs. Grades against the negotiated contract, not the original spec.
Generator-Evaluator Contract: A negotiated, file-based agreement between the Generator and Evaluator that defines exactly what 'done' means for a given sprint — specific features to build and specific tests that must pass. Written and revised via shared files on disk before any code is written. Replaces reliance on the Planner's spec for grading.
Context Rot: The degradation of coherence as an agent works deeper into a context window. Output becomes less consistent and on-track the further the session progresses without intervention.
Context Anxiety: A model behaviour where, as it approaches the end of its context window, it rushes to finish tasks prematurely and incompletely rather than managing the situation gracefully.
RALF Loop: A technique (originally from Jeffrey Huntley) of feeding a prompt into a coding agent CLI on a loop until all tasks are complete. Described as 'deterministically bad in an undeterministic world' — better to fail predictably than succeed unpredictably. Has a fixed plan with no adversarial pressure from a separate evaluator.
Adversarial Pressure: The productive tension between the Generator and Evaluator — analogous to the relationship between generator and discriminator in a GAN. The Evaluator's independent, harsh grading forces the Generator to genuinely improve rather than self-certify.
Context Window Compaction: A mechanism (including server-side compaction) that summarises and compresses prior context to allow a session to continue beyond the raw context limit. Does not equal coherence — summaries are lossy and can drift over very long runs.
Persistent Artifacts: Files written to disk that maintain shared state across agent sessions and context windows: featurelist.json, progress files, init scripts, learnings logs, and timestamped decision records. The file system is the preferred shared state mechanism for long-running agents.
Progressive Disclosure: A context-efficiency technique where only the front matter of a skill or tool description is loaded into the context window initially; the full body is loaded only if that skill is instantiated. Reduces upfront context consumption.
Skills: Packaged tool descriptions, grading rubrics, or behavioural instructions that can be loaded into an agent's context using progressive disclosure. A useful primitive for encoding quality criteria into a reusable, composable form.
Spiky Behaviours: The specific, model-generation-level failure modes or weaknesses that a harness must compensate for. Identifying the current model's spiky behaviours and filling those gaps with scaffolding is the core discipline of harness design.
Hill Climbing: The iterative process by which the Generator improves its output across rubric dimensions in response to Evaluator critique. If the Generator cannot hill climb against a criterion after repeated attempts, the harness should discard and restart rather than continue patching.
AI Slop: The aesthetic failure mode of AI-generated front-end design: generic layouts, purple gradients, and visually undifferentiated output. Used as a calibration anti-example when tuning the Evaluator's design taste.

// FREQUENTLY ASKED QUESTIONS

What is the Planner-Generator-Evaluator framework for AI agents?

It is a three-role harness architecture where a Planner decomposes a vague prompt into sprint-level features, a Generator builds them one at a time, and an adversarial Evaluator grades output with live verification tools against a negotiated contract. Each role runs in its own context window with a separate system prompt, preventing the sycophancy bias that occurs when a single agent reviews its own work. The framework is specifically designed for agent sessions running 30 minutes to several hours.

What is the Generator-Evaluator contract in long-running agents?

The Generator-Evaluator contract is a file-based agreement negotiated before any code is written. The Generator proposes what it will build and how success should be verified; the Evaluator pushes back on scope, weak tests, or missed edge cases via shared files on disk. They iterate until both agree on 20–30 granular, testable criteria. This contract — not the original Planner spec — becomes the ground truth for grading, converting fuzzy user stories into actionable assertions.

How do I set up a Planner-Generator-Evaluator agent system?

Start by feeding a one-line user prompt to a Planner agent that outputs a featurelist.json and progress file. Then instantiate three separate agents — Planner, Generator, Evaluator — each with its own context window and system prompt. Before building, run a Generator-Evaluator contract negotiation via shared files. The Generator builds one feature per loop, and the Evaluator verifies with live tools like Playwright MCP. Grade against the contract, not the original spec. Update the progress file after each passing feature.

How do I write a quality rubric for an AI agent evaluator?

Write opinionated criteria across 2–4 dimensions such as Design, Originality, Craft, and Functionality. Weight dimensions toward the model's weaknesses — not Functionality if the model already handles it well. Aim for 20–30 granular contract items per sprint. Calibrate the Evaluator with few-shot reference examples showing what 'good' and 'bad' look like. Vague criteria produce vague critiques; specific criteria like 'drag-and-drop reordering must persist on page reload' produce actionable fixes.

How does the Planner-Generator-Evaluator framework compare to a RALF loop?

A RALF loop feeds a prompt into a coding agent CLI repeatedly until tasks complete — it has a fixed plan with no adversarial pressure. The Planner-Generator-Evaluator framework adds an independent Evaluator in a separate context window that actively tests output with live tools, enabling course correction over long time horizons. The RALF loop is 'deterministically bad in an undeterministic world,' while the three-role harness can discard failing approaches and restart, rather than patching indefinitely.

When should I use a multi-agent harness instead of a single AI agent?

Use a multi-agent harness whenever your task is expected to run longer than roughly 30 minutes, involves building complex multi-feature applications, or requires subjective quality judgment like design taste. Single agents suffer from self-evaluation bias — they call half-implemented features done. A multi-agent harness with adversarial evaluation maintains quality over hours-long sessions where context rot and context anxiety would degrade a solo agent's output.

What results can I expect from the Planner-Generator-Evaluator framework?

You can expect production-grade, multi-feature applications built autonomously over 4–6 hour sessions with consistent quality across design, functionality, and originality dimensions. The adversarial evaluation loop catches issues a solo agent would self-certify as complete. Visual design converges toward your defined taste within 5–15 evaluation rounds. The framework also produces persistent artifacts — progress files, learnings logs, feature lists — that make handoffs and debugging transparent.

What is context rot in AI agents and how do I prevent it?

Context rot is the degradation of coherence as an agent works deeper into its context window — output becomes less consistent and on-track the further the session goes. Prevent it by managing context deliberately: use fresh context windows per feature, structured hand-offs via files on disk (not lossy summaries), and JSON-based persistent state. The specific strategy depends on your model generation — newer models may handle longer sessions without severe rot, reducing the need for frequent resets.

Why shouldn't I let an AI agent evaluate its own code?

Models cannot reliably judge their own output. The same sycophancy and generosity bias present in LLM-as-judge systems applies to coding agents — a builder reviewing its own work will see a half-baked button and call it done. The key insight from GANs applies: tuning a standalone critic to be harsh is tractable, but tuning a builder to be self-critical is not. Always use an adversarial evaluator in a separate context window that sees only the output artifact, never the Generator's reasoning trace.

How do I debug a long-running AI agent that's drifting?

Read the full agent transcripts by hand, line by line. This is the primary debugging loop — not running more experiments. Empathize with why the model made each decision at every divergence point. Look for where the Evaluator's judgment diverged from yours, then update its system prompt and rubric to close that gap. Optionally pipe transcripts to a secondary agent to grep for patterns. Only after understanding the traces can prompt tuning be precise enough to fix drift.

// GET THIS SKILL — FREE