Frequently Asked Questions About Anthropic Planner-Generator-Evaluator Long-Agent Framework

22 answers covering everything from basics to advanced usage.

// Basics

What is a Generator-Evaluator contract in multi-agent systems?

A Generator-Evaluator contract is a negotiated, file-based agreement that defines exactly what 'done' means for a given sprint — specific features to build and specific tests that must pass. Before any code is written, the Generator proposes what it will build and the Evaluator pushes back on scope, weak tests, or missed edge cases via shared files on disk. The resulting contract replaces the Planner's vague spec as the ground truth for grading.

What is context anxiety in AI agents?

Context anxiety is a model behaviour where, as it approaches the end of its context window, it rushes to finish tasks prematurely and incompletely rather than managing the situation gracefully. In long-running agent sessions, this manifests as the agent cutting corners on the final features, declaring work done when it isn't, or skipping verification steps. The Planner-Generator-Evaluator framework mitigates this through fresh context windows per feature and file-based state management.

What are persistent artifacts and why do long-running agents need them?

Persistent artifacts are files written to disk that maintain shared state across agent sessions and context windows: featurelist.json, progress files, init scripts, learnings logs, and timestamped decision records. Long-running agents need them because lossy summaries from context-window compaction drift over time. The file system is more reliable than context-window memory — it doesn't degrade, can be read by fresh agent sessions, and provides a source of truth independent of any single context window.

Why use JSON instead of markdown for agent state files?

Models tend to overwrite markdown files — they treat markdown as a document to be rewritten rather than a data structure to be updated. JSON files for feature lists, progress tracking, and learnings logs are more durable because models are more likely to parse and update them incrementally rather than replacing them entirely. This is a practical finding from operating long-running agent harnesses where state preservation across sessions is critical.

// How To

How do I negotiate a Generator-Evaluator contract before coding starts?

The Generator proposes what it will build and how it should be verified by writing a contract file to disk. The Evaluator reads the file and pushes back on scope, weak tests, or missed edge cases by writing its response to the same shared location. They iterate via file read/write until both agree on 20-30 granular criteria. The contract should include specific, testable assertions — not vague goals. For example: 'drag-and-drop reordering must persist on reload; verify by dragging three tasks and refreshing.'

How do I calibrate an Evaluator agent to match my design taste?

Provide few-shot reference examples of good and bad output in the Evaluator's system prompt. Label screenshots or code samples explicitly — 'this is good design' and 'this is AI slop.' Write granular rubric criteria weighted toward the subjective dimensions you care about most (Design, Originality) rather than dimensions the model already handles well (Functionality). After each run, read the Evaluator's transcripts, find where its judgment diverged from yours, and update the prompt to close that gap.

How do I set up Playwright MCP for the Evaluator agent?

Deploy the Evaluator with Playwright MCP as a verification tool for web applications. The Evaluator should launch live pages, click through UI elements, submit forms, resize windows, and take screenshots — actively stress-testing the Generator's output rather than just reading diffs or code. For native apps, use computer use instead of Playwright. The key is that the Evaluator interacts with the actual running artifact, not a static representation of it, to catch real user-facing bugs.

How do I decide whether to use compaction or fresh context windows between features?

It depends on your model generation's spiky behaviours. Models with severe context rot or context anxiety benefit from fresh context windows per feature — the Generator starts clean with just the init script, progress file, and contract. Models with strong coherence over long windows can use compaction within a continuous session. Test both approaches and compare output quality. The wrong choice wastes cost (unnecessary resets for capable models) or produces degraded output (compaction for models that can't handle it).

// Troubleshooting

My Evaluator keeps approving mediocre output — how do I make it harsher?

Out of the box, LLMs make bad QA agents — they find bugs and defer them rather than blocking on them. Fix this with significant prompt tuning: add explicit instructions to block on any failing criterion, provide few-shot examples of harsh critiques, and include anti-examples of the 'find a bug but defer it' pattern with instructions to never do this. Weight your rubric toward the dimensions where you're seeing the most leniency. Plan for multiple rounds of calibration — this is expected, not exceptional.

My Generator keeps patching the same broken approach instead of restarting — what's wrong?

Your harness lacks the discard-and-restart mechanism that is core to the framework. If the Generator cannot hill-climb against a rubric criterion after repeated attempts, the harness must discard the current attempt entirely and restart from scratch — not keep patching. Implement a maximum retry count per criterion, and when exceeded, wipe the current feature branch and begin a fresh attempt. This ability to course-correct over long time horizons is the core advantage over single-agent loops.

My Planner is over-specifying technical decisions and causing cascading errors — how do I fix this?

Constrain the Planner's system prompt to output only sprint-level feature lists with no granular technical decisions. The Planner's job is to set hard outer lines for the product — the what, not the how. Technical specifics should be resolved during the Generator-Evaluator contract negotiation where mistakes can be caught and corrected before code is written. If the Planner specifies a wrong architectural choice, that error magnifies across every subsequent sprint over a multi-hour horizon.

My long-running agent's output is drifting from the original intent — what's happening?

This is likely context rot — coherence degradation as the agent works deeper into its context window. Fix it by using the file system as shared state instead of relying on context-window memory. Ensure the progress file, featurelist.json, and contract are read at the start of each loop iteration. Consider switching to fresh context windows per feature if your model shows severe context rot. Also check whether the Evaluator is grading against the negotiated contract — if it's grading against a drifted interpretation of the spec, the Generator will drift too.

// Comparisons

How does the Planner-Generator-Evaluator framework compare to AutoGPT or BabyAGI?

AutoGPT and BabyAGI use a single agent with self-directed task decomposition and self-evaluation. The Planner-Generator-Evaluator framework fundamentally differs by separating roles into distinct agents with separate context windows and adding adversarial evaluation. AutoGPT-style agents suffer from the self-evaluation trap — the same model that built something rates it favorably. The PGE framework also uses file-based contracts and persistent artifacts instead of relying on in-context memory, making it suitable for genuinely long runs where those architectures collapse.

How is the Planner-Generator-Evaluator framework different from just using LLM-as-judge?

LLM-as-judge typically evaluates output in the same context or with the same model that produced it, inheriting sycophancy and generosity bias. The PGE framework makes the Evaluator a structurally separate agent with its own context window, system prompt, and harshly-tuned rubric — it never sees the Generator's reasoning trace. It also uses live verification tools like Playwright to actively test artifacts, not just read outputs. The framework exploits the insight that tuning a critic to be harsh is tractable; tuning a builder to be self-critical is not.

Can I use the Planner-Generator-Evaluator framework with open-source models?

Yes, but you must identify the spiky behaviours of your specific model and adjust the harness accordingly. Open-source models may have smaller context windows (requiring more frequent fresh sessions), weaker planning capabilities (requiring more detailed Planner prompts), or less robust tool-calling (requiring simpler verification approaches). The framework's core principle — harness and model co-evolve — means you should add more scaffolding for weaker models and remove it as you upgrade. The structural pattern of separate Planner, Generator, and Evaluator roles applies regardless of model provider.

// Advanced

How does progressive disclosure work in the Planner-Generator-Evaluator framework?

Progressive disclosure is a context-efficiency technique where only the front matter of a skill or tool description is loaded into the context window initially — a brief summary of what the skill does and when to use it. The full body is loaded only if that skill is actually instantiated for the current task. This reduces upfront context consumption significantly, which is critical for long-running agents where every token of context budget matters for maintaining coherence.

How do I know when to remove a scaffold component from my agent harness?

Regularly reassess which components are load-bearing by running a simplified version of your harness and comparing output quality. For example, if a newer model no longer exhibits context anxiety, test removing the forced session resets between features. If the model can hold coherence across a 2-hour continuous build, test reducing sprint-level decomposition. Only commit to removing a component after you've confirmed the model upgrade has rendered it redundant. Read traces from the simplified run to verify — don't just check final output quality.

What is the GAN analogy in the Planner-Generator-Evaluator framework?

The framework borrows the core insight from Generative Adversarial Networks: productive tension between a generator and a discriminator. The Generator produces artifacts and the Evaluator harshly grades them, creating adversarial pressure that forces genuine improvement. Just as in GANs, the key asymmetry is exploited — it is far easier to critique a meal than to cook one. Tuning the Evaluator to be harsh is tractable; tuning the Generator to be self-critical is not. Separate context windows preserve this adversarial dynamic.

How do I handle features the Generator repeatedly fails to build?

If the Generator cannot hill-climb against a rubric criterion after repeated attempts, the harness should discard the current attempt entirely and restart that sprint from scratch — not keep patching. This is a core mechanism: the ability to course-correct by abandoning failing approaches is what separates the PGE framework from endless-patching loops. Set a maximum retry threshold, and upon exceeding it, wipe the feature branch, optionally vary the generative seed, and begin fresh while preserving learnings in the JSON log file.

How many contract criteria should I target per sprint?

Target 20-30 granular contract criteria per sprint for meaningful, actionable grading. Fewer criteria produce evaluations that are too vague for the Generator to act on. More than 30 can overwhelm a single sprint scope. Each criterion should be specific and testable — 'drag-and-drop reordering must persist on reload; verify by dragging three tasks and refreshing' rather than 'drag-and-drop should work.' The contract is the ground truth for grading, not the original Planner spec.

Can I pipe agent transcripts to another AI to help debug the harness?

Yes — after you've read transcripts yourself, you can optionally pipe them to a secondary agent to grep for patterns and suggest prompt updates. However, this is a supplement to manual reading, not a replacement. The primary debugging loop must involve you reading transcripts by hand, line by line, to empathize with why the model made each decision. Only by understanding the model's reasoning can you know which scaffold components to delete, adjust, or keep. The secondary agent helps with scale, not insight.

What rubric dimensions should I weight most heavily?

Weight toward the dimensions where your model is weakest, not the ones it already handles well. Most current models handle Functionality reasonably — code that runs, APIs that respond. They struggle more with Design (avoiding AI slop aesthetics), Originality (producing differentiated output), and Craft (polished details). If your model already passes functionality tests but produces generic-looking UIs, weight Design and Originality heavily and give Functionality minimal weight in the Evaluator's scoring.