Anthropic Planner-Generator-Evaluator Long-Agent Framework

Design and operate AI agent harnesses that run coherently for hours without losing context, quality, or direction — producing production-grade outputs a solo agent loop cannot achieve.

// TL;DR

The Anthropic Planner-Generator-Evaluator Long-Agent Framework is a multi-agent harness architecture that lets AI agents run coherently for hours without losing context, quality, or direction. It separates planning, building, and adversarial evaluation into distinct agents with their own context windows, using file-based contracts and persistent artifacts instead of relying on a single agent's self-review. Use it when architecting agent systems that need to run longer than 30 minutes, building complex multi-feature applications autonomously, or diagnosing why a long-running agent is drifting, stalling, or rubber-stamping its own poor output.

// When should I use the Planner-Generator-Evaluator framework for my AI agents?

Use this skill whenever you are architecting an agent system expected to run longer than ~30 minutes, building complex multi-feature applications autonomously, or diagnosing why a long-running agent is drifting, stalling, or rubber-stamping its own poor output.

// What inputs do I need to set up a Planner-Generator-Evaluator agent harness?

  • task_promptrequired
    The high-level, intentionally vague user request (e.g. 'build a retro game maker'). One line is fine — the Planner handles decomposition.
  • quality_rubricrequired
    Your written, opinionated criteria for what 'good' looks like across 2-4 dimensions (e.g. Design, Originality, Craft, Functionality). Must be specific enough to produce granular, actionable critique.
  • model_selectionrequired
    Which model(s) to use for each role. Drives harness design choices — capabilities and failure modes differ per model generation.
  • target_domainrequired
    The type of artifact being built (web app, CLI tool, data pipeline, etc.) — determines which verification tools to deploy (e.g. Playwright MCP for web apps, computer use for native apps).
  • reference_examples
    Few-shot examples of good and bad output that calibrate the Evaluator's taste toward yours (e.g. 'this is good design, this is AI slop').

// What are the core principles behind the Planner-Generator-Evaluator long-agent framework?

Self-Evaluation Is a Trap

Models cannot reliably judge their own output. The same sycophancy and generosity bias that appears in general LLM-as-judge systems applies equally to coding agents. A builder that reviews its own work will see a half-baked button and call it done. Always use an adversarial evaluator in a separate context window.

The Generator-Evaluator Gap

The key insight stolen from GANs: tuning a standalone critic to be harsh is tractable; tuning a builder to be self-critical is not. It is far easier to critique a meal than to cook one. Exploit this asymmetry by giving each role its own context window, system prompt, and job.

Contracts Over Specs

Before the Generator writes a single line, the Generator and Evaluator negotiate what 'done' means via files on disk — one writes the markdown, the other reads and pushes back. The Evaluator grades against this negotiated contract, not the original spec. This converts fuzzy user stories into granular, testable assertions without the Planner over-specifying upfront.

Context Rot and Context Anxiety

As a session deepens, coherence degrades (context rot). Near the context limit, models rush to finish prematurely (context anxiety). Design the harness to manage context deliberately — via compaction, fresh sessions per feature, or structured hand-offs — depending on which model generation you are running.

Structured Hand-offs and Clean Contexts

Lossy summaries drift. Instead of relying on compaction to preserve meaning across very long runs, use the file system as shared state. Persistent artifacts — progress files, feature lists as JSON (not markdown, which models overwrite), timestamped learnings logs — are more reliable than context-window memory.

Harness and Model Co-Evolve

The harness does not disappear as models improve — it evolves. Identify the spiky failure modes of each model generation and fill those gaps with scaffolding. As the model improves and internalises a harness behaviour, simplify or remove that scaffold component. The frontier moves; your harness should track it.

Taste Is Gradable

Subjective quality — design taste, originality, aesthetic — is gradable if you have a strong enough opinion and write it down. Calibrate the Evaluator with few-shot reference examples of good and bad output. Vague criteria produce vague critiques; granular criteria produce actionable fixes.

High-Level Planning Only

The Planner should output a deliberately high-level spec broken into sprints — never granular technical decisions. Over-specification at the planning stage causes errors to cascade and magnify over a multi-hour time horizon. Let the Generator and Evaluator resolve the technical details through their contract.

Read the Traces

The primary debugging loop for any long-running harness is reading agent transcripts by hand, line by line, not running more experiments. Only by empathising with the model — understanding why it made each decision — can you know which scaffold components to delete, adjust, or keep as the frontier moves.

// How do you apply the Planner-Generator-Evaluator framework step by step?

  1. 1

    Initialise the Planner agent

    Feed the one-line user prompt to a Planner agent (use your most capable planning model, e.g. Opus-class). Its output must be: (a) a sprint-level feature list saved as featurelist.json — JSON, not markdown, because models are less likely to overwrite JSON files; (b) a progress file tracking feature completion state; (c) a Git repo initialised with an init script so subsequent sessions do not re-derive setup steps. Keep the spec high-level — no granular technical decisions. The Planner's job is to set hard outer lines for the product, not micromanage implementation.

  2. 2

    Assign roles and context windows

    Instantiate three separate agents, each with its own context window, system prompt, and single responsibility: Planner (done after Step 1), Generator (builder/IC), Evaluator (critic/QA). Never share raw Generator context with the Evaluator — this muddies the adversarial pressure. The Evaluator should only see the output artifact, not the Generator's reasoning trace.

  3. 3

    Run the Generator-Evaluator contract negotiation

    Before any code is written, the Generator proposes what it will build and how it should be verified. The Evaluator responds via a shared file on disk — pushing back on scope, weak tests, or missed edge cases. They iterate via file read/write until both agree. The resulting contract is the ground truth for this sprint — not the original Planner spec. Target ~20-30 granular contract criteria for meaningful, actionable grading. Vague criteria produce vague critiques.

  4. 4

    Execute the Generator build loop

    The Generator picks one feature (only one) that has not yet passed all tests, orients itself using the progress file and init script, builds the feature, and runs verification. Use programmatic tool calling where possible to reduce context consumption. The Generator should write timestamped learnings and state to a JSON log file throughout — these are breadcrumbs for future sessions or human handoff.

  5. 5

    Deploy the Evaluator with live verification tools

    The Evaluator must actively use the artifact — launch Playwright MCP (for web apps) or computer use (for native apps) to open live pages, click around, and stress-test features. It grades against the negotiated contract across your rubric dimensions (e.g. Design, Originality, Craft, Functionality — weight toward the dimensions where the model is weakest, not functionality if the model already handles that well). Calibrate harshness with few-shot reference examples of good and bad output before deploying.

  6. 6

    Handle the Evaluator feedback loop

    The Evaluator writes its critique and score back to a shared file. If the Generator cannot hill-climb against the rubric after repeated attempts on a given criterion, the harness should discard the current attempt entirely and restart — not keep patching the same broken approach. This ability to course-correct over long time horizons is the core advantage over a RALF loop or single-session self-review.

  7. 7

    Update the progress file and loop

    If the feature passes all contract criteria, the Generator writes the Git commit and marks the feature as complete in featurelist.json. If unfinished features remain, the loop continues — either in the same session (with compaction, for capable models) or in a fresh context window (for models with severe context rot or anxiety). Choose based on the spiky behaviours of your current model generation.

  8. 8

    Read traces and tune the harness prompts

    After each run, read the full agent transcripts by hand. Find every point where the Evaluator's judgment diverged from yours. Treat this like reading a stack trace — empathise with why the model made each decision. Update the Evaluator's system prompt and rubric to close that gap. Optionally, pipe transcripts to a secondary agent to grep for patterns and suggest prompt updates. Only delete harness components when model improvements have rendered them redundant — run a simplified version and evaluate before committing.

  9. 9

    Adapt the harness to the current model generation

    Regularly reassess which scaffold components are still load-bearing. Examples: context-window resets between sessions may be critical for one model generation and unnecessary the next; sprint decomposition may be essential for weaker planners but removable for stronger agentic models; Evaluator cadence may shift from every sprint to end-of-full-generation. The harness is always right for a specific model — track model releases and strip components accordingly.

// What are real examples of the Planner-Generator-Evaluator framework in action?

A user provides the prompt 'build a full-stack project management tool' with no further specification.

The Planner receives the vague prompt and outputs a sprint-level featurelist.json covering major workflow areas (project creation, task assignment, progress tracking, notifications) without specifying technical stack details. The Generator and Evaluator negotiate a contract of ~25 criteria — e.g. 'drag-and-drop reordering must persist on reload; verify by dragging three tasks and refreshing'. The Generator builds one feature per loop iteration. The Evaluator launches Playwright, drags tasks, refreshes, and grades against the contract. When the Generator repeatedly fails to pass the 'real-time collaboration' criterion, the harness discards that attempt and restarts that sprint from scratch rather than patching. After 4-6 hours the progress file shows all features marked complete with passing tests.

A team wants to improve the visual design quality of AI-generated front-end UIs, which keep producing 'AI slop' aesthetics (purple gradients, generic layouts).

A rubric with four criteria is created: Design, Originality, Craft, Functionality — weighted heavily toward Design and Originality since the model already handles Functionality. The Evaluator is calibrated with few-shot reference screenshots labelled 'good design' and 'AI slop'. The Generator produces an HTML/CSS page; the Evaluator takes Playwright screenshots and scores across the rubric. If Originality scores consistently low across multiple rounds, the harness pivots — discards the current design direction entirely and restarts with a different generative seed — rather than iterating on the same failing aesthetic. After 5-15 rounds the output converges toward the rubric's defined taste.

An agent harness was built and tuned for a previous model generation and the team upgrades to a newer, more agentic model.

The team identifies the spiky behaviours of the new model: context anxiety is gone (no need for forced session resets), the model can hold coherence across a 2-hour continuous build (sprint-by-sprint decomposition less critical), and it is willing to discard its own work and restart when the rubric is not met. The team runs a simplified harness — single continuous session with compaction, Evaluator running at end-of-full-generation rather than per sprint — and compares output quality and cost. Scaffold components rendered redundant by the model upgrade are removed. The harness remains structurally the same (Planner-Generator-Evaluator) but with fewer moving parts.

// What mistakes should I avoid when building a Planner-Generator-Evaluator harness?

  • Self-evaluation is a trap: never instruct the Generator to review its own output in the same context window. The sycophancy and generosity bias is just as present in coding tasks as in conversational ones — it will call a half-implemented feature done.
  • Sharing the Generator's reasoning trace with the Evaluator muddies adversarial pressure. The Evaluator should see only the output artifact, not how it was built — otherwise the model kids itself that something is working based on the Generator's narrative.
  • Vague rubric criteria produce vague critiques. If the Evaluator's grading language is imprecise, the Generator shrugs and makes arbitrary changes. Force yourself to write down granular, opinionated criteria — 20-30 contract items per sprint is a reasonable target.
  • Over-specifying in the Planner causes cascading errors. If the Planner tries to define granular technical decisions, any mistake at that stage magnifies across every subsequent sprint over a multi-hour horizon. Keep the Planner high-level and let the Generator-Evaluator contract handle specifics.
  • Using markdown files for persistent state is risky — models tend to overwrite them. Use JSON files for feature lists, progress tracking, and learnings logs.
  • Context rot and context anxiety are model-specific. Applying a fresh-session-per-feature approach designed for a model with severe context rot to a newer model with strong coherence adds unnecessary complexity and cost. Reassess after every major model release.
  • Compaction does not equal coherence. Lossy summaries drift over very long runs. Do not assume compaction alone is sufficient for state management — use the file system as the source of truth for shared state.
  • Running more experiments instead of reading traces is a false shortcut. The primary debugging loop is reading agent transcripts by hand, line by line, to understand where the model's judgment diverged from yours. Only then can prompt tuning be precise.
  • Treating the harness as permanent is a mistake. Components that are load-bearing for one model generation may be redundant for the next. Actively hunt for scaffold components to delete as model capabilities improve.
  • Out of the box, LLMs make bad QA agents — they will find a bug and defer it ('fix in 2 weeks') rather than blocking on it. Significant prompt tuning effort is required to make the Evaluator genuinely harsh. Plan for this calibration work upfront.

// What do the key terms in the Planner-Generator-Evaluator framework mean?

Planner
The first agent in the three-role harness. Receives the vague user prompt and produces a high-level, sprint-level spec saved as persistent artifacts (featurelist.json, progress file, init script). Never plans granular technical details — its job is to set the hard outer lines of the product.
Generator
The builder/IC role in the harness. Operates in its own context window, picks one feature at a time from the feature list, implements it, and negotiates the definition-of-done contract with the Evaluator before writing any code.
Evaluator
The adversarial critic/QA role in the harness. Operates in its own context window with a separate, harshly-tuned system prompt. Uses live verification tools (e.g. Playwright MCP) to actively test the artifact — not just read diffs. Grades against the negotiated contract, not the original spec.
Generator-Evaluator Contract
A negotiated, file-based agreement between the Generator and Evaluator that defines exactly what 'done' means for a given sprint — specific features to build and specific tests that must pass. Written and revised via shared files on disk before any code is written. Replaces reliance on the Planner's spec for grading.
Context Rot
The degradation of coherence as an agent works deeper into a context window. Output becomes less consistent and on-track the further the session progresses without intervention.
Context Anxiety
A model behaviour where, as it approaches the end of its context window, it rushes to finish tasks prematurely and incompletely rather than managing the situation gracefully.
RALF Loop
A technique (originally from Jeffrey Huntley) of feeding a prompt into a coding agent CLI on a loop until all tasks are complete. Described as 'deterministically bad in an undeterministic world' — better to fail predictably than succeed unpredictably. Has a fixed plan with no adversarial pressure from a separate evaluator.
Adversarial Pressure
The productive tension between the Generator and Evaluator — analogous to the relationship between generator and discriminator in a GAN. The Evaluator's independent, harsh grading forces the Generator to genuinely improve rather than self-certify.
Context Window Compaction
A mechanism (including server-side compaction) that summarises and compresses prior context to allow a session to continue beyond the raw context limit. Does not equal coherence — summaries are lossy and can drift over very long runs.
Persistent Artifacts
Files written to disk that maintain shared state across agent sessions and context windows: featurelist.json, progress files, init scripts, learnings logs, and timestamped decision records. The file system is the preferred shared state mechanism for long-running agents.
Progressive Disclosure
A context-efficiency technique where only the front matter of a skill or tool description is loaded into the context window initially; the full body is loaded only if that skill is instantiated. Reduces upfront context consumption.
Skills
Packaged tool descriptions, grading rubrics, or behavioural instructions that can be loaded into an agent's context using progressive disclosure. A useful primitive for encoding quality criteria into a reusable, composable form.
Spiky Behaviours
The specific, model-generation-level failure modes or weaknesses that a harness must compensate for. Identifying the current model's spiky behaviours and filling those gaps with scaffolding is the core discipline of harness design.
Hill Climbing
The iterative process by which the Generator improves its output across rubric dimensions in response to Evaluator critique. If the Generator cannot hill climb against a criterion after repeated attempts, the harness should discard and restart rather than continue patching.
AI Slop
The aesthetic failure mode of AI-generated front-end design: generic layouts, purple gradients, and visually undifferentiated output. Used as a calibration anti-example when tuning the Evaluator's design taste.

// FREQUENTLY ASKED QUESTIONS

What is the Planner-Generator-Evaluator framework for long-running AI agents?

It is a three-role multi-agent harness where a Planner decomposes a vague prompt into sprints, a Generator builds features one at a time, and an Evaluator adversarially grades output using live verification tools — each in separate context windows. The framework uses file-based contracts and persistent artifacts on disk to maintain coherence across hours-long sessions, preventing the context rot, context anxiety, and self-evaluation bias that cause single-agent loops to degrade.

What is adversarial evaluation in multi-agent systems?

Adversarial evaluation is the practice of using a separate, harshly-tuned critic agent — the Evaluator — to grade the Generator's output in its own context window. The Evaluator only sees the output artifact, never the Generator's reasoning trace. This mirrors the GAN dynamic: it's easier to tune a standalone critic to be harsh than to make a builder self-critical. The adversarial pressure prevents the sycophancy and generosity bias that causes self-reviewing agents to call half-baked work done.

How do I set up a Planner-Generator-Evaluator agent harness?

Start by feeding your one-line prompt to a Planner agent that outputs a sprint-level featurelist.json, a progress file, and an init script in a Git repo. Then instantiate three separate agents — Planner, Generator, Evaluator — each with its own context window and system prompt. Before coding, the Generator and Evaluator negotiate a contract of 20-30 granular criteria via shared files on disk. The Generator builds one feature per loop, and the Evaluator grades using live tools like Playwright MCP.

How do I write a quality rubric for the Evaluator agent?

Write opinionated criteria across 2-4 dimensions such as Design, Originality, Craft, and Functionality — weighting toward dimensions where the model is weakest. Each criterion must be specific enough to produce granular, actionable critique, not vague praise. Calibrate with few-shot reference examples showing what 'good' and 'bad' look like. Target 20-30 contract items per sprint. Vague criteria produce vague critiques that the Generator shrugs off; granular criteria drive real improvement.

How does the Planner-Generator-Evaluator framework compare to a RALF loop?

A RALF loop feeds a prompt into a coding agent CLI on repeat until tasks complete — it has a fixed plan with no adversarial pressure from a separate evaluator. The Planner-Generator-Evaluator framework adds three key advantages: sprint-level decomposition by a dedicated Planner, adversarial evaluation by a separate critic agent, and the ability to discard and restart failing approaches rather than endlessly patching. RALF is 'deterministically bad in an undeterministic world'; this framework enables genuine quality hill-climbing over multi-hour runs.

When should I use the Planner-Generator-Evaluator framework instead of a single-agent loop?

Use it whenever your agent task will run longer than roughly 30 minutes, involves building complex multi-feature applications, or requires subjective quality judgments like design taste. Single-agent loops suffer from self-evaluation bias, context rot, and context anxiety over long runs. If your agent is drifting from the original intent, stalling on features, or approving its own poor output, the three-role separation with adversarial evaluation is the structural fix.

What results can I expect from running a Planner-Generator-Evaluator harness?

Expect production-grade multi-feature applications built autonomously over 4-6 hour sessions, with each feature verified against a negotiated contract of 20-30 criteria. Quality converges toward your rubric's defined taste over 5-15 evaluation rounds per feature. The key outcome is coherent, high-quality output that a solo agent loop cannot achieve — particularly in subjective dimensions like design and originality where self-evaluation consistently fails.

What is context rot and how do I prevent it in long-running agents?

Context rot is the degradation of coherence as an agent works deeper into its context window — output becomes less consistent and on-track the further the session progresses. Prevent it by using the file system as shared state (featurelist.json, progress files, learnings logs), starting fresh context windows per feature when needed, and using structured hand-offs instead of relying on lossy compaction summaries. The specific mitigation depends on your model generation's spiky behaviours.

Why should the Evaluator never see the Generator's reasoning trace?

Sharing the Generator's reasoning trace with the Evaluator muddies the adversarial pressure that makes the framework work. If the Evaluator reads how the Generator approached a problem, the model can convince itself something is working based on the narrative rather than the actual artifact. The Evaluator should only see the output — the deployed app, the rendered page, the test results — and grade against the negotiated contract, preserving the GAN-like tension.

How do I debug a Planner-Generator-Evaluator harness that isn't working well?

Read the full agent transcripts by hand, line by line. This is the primary debugging loop — not running more experiments. Find every point where the Evaluator's judgment diverged from yours, empathize with why the model made each decision, and update the Evaluator's system prompt and rubric to close that gap. Optionally pipe transcripts to a secondary agent to grep for patterns. Only delete harness components when you've confirmed model improvements have rendered them redundant.

// GET STARTED

Turn Any YouTube Video Into An AI Skill

SkillForge captures a creator's exact methodology from their video and turns it into a reusable AI skill you can invoke in Claude, ChatGPT, or any LLM.

Forge your own skill