Question 1

What is adversarial pressure in a multi-agent system?

Accepted Answer

Adversarial pressure is the productive tension between the Generator and Evaluator, analogous to the generator-discriminator relationship in GANs. The Evaluator's independent, harsh grading forces the Generator to genuinely improve rather than self-certify. This pressure only works when the Evaluator operates in a separate context window and sees only the output artifact — never the Generator's reasoning trace, which would muddy its objectivity.

Question 2

What is the difference between context rot and context anxiety?

Accepted Answer

Context rot is the gradual degradation of coherence as an agent works deeper into its context window — outputs drift and become inconsistent. Context anxiety is a distinct behavior where, as the model approaches its context limit, it rushes to finish tasks prematurely and incompletely. Both are model-generation-specific: some models suffer from severe context rot but no anxiety, while others exhibit the opposite. Your harness design should address whichever spiky behavior your current model exhibits.

Question 3

Can I use the Planner-Generator-Evaluator framework with open-source models?

Accepted Answer

Yes, but you must identify the spiky behaviors of your specific model and adjust the harness accordingly. Open-source models may exhibit more severe context rot, weaker planning ability, or less reliable tool use — each requiring additional scaffolding. You might need more frequent context window resets, stricter contract criteria, or a stronger commercial model in the Evaluator role while using the open-source model as the Generator. The framework's architecture is model-agnostic; the tuning is model-specific.

Question 4

How do I choose which model to use for each role in the harness?

Accepted Answer

Use your most capable planning model (Opus-class) for the Planner since it runs once and sets the project direction. The Generator needs strong coding and tool-use capabilities with good context coherence. The Evaluator needs strong judgment and the ability to use verification tools like Playwright MCP. You can mix models across roles — for example, a stronger model as Evaluator and a faster, cheaper model as Generator — since their context windows are separate.

Question 5

How do I write a Generator-Evaluator contract that actually works?

Accepted Answer

Start by having the Generator propose what it will build and specific verification steps. The Evaluator pushes back via shared files on disk — questioning scope, identifying weak tests, and surfacing edge cases. Target 20–30 granular criteria per sprint. Each criterion should be testable: 'drag-and-drop reordering must persist on page reload' is good; 'the UI should be nice' is too vague. Iterate the contract through file read/write until both agents agree before any code is written.

Question 6

How do I calibrate the Evaluator to match my taste?

Accepted Answer

Provide few-shot reference examples of good and bad output in the Evaluator's system prompt. For design quality, include screenshots labeled 'good design' versus 'AI slop.' For code quality, show examples of clean versus sloppy implementations. Write your rubric with granular, opinionated criteria across 2–4 dimensions, weighted toward dimensions where the model is weakest. After each run, read the Evaluator's transcripts and adjust its prompt wherever its judgment diverged from yours.

Question 7

How do I set up Playwright MCP for the Evaluator to verify web apps?

Accepted Answer

Configure Playwright MCP as a tool available to the Evaluator agent. The Evaluator should be instructed to actively launch live pages, click around, fill forms, and stress-test features — not just read code diffs. Structure the contract criteria so each one maps to a verifiable action the Evaluator can perform via Playwright: loading a page, interacting with elements, checking persistence after refresh. For native apps, use computer use tools instead of Playwright MCP.

Question 8

Why does my AI agent keep calling half-finished features done?

Accepted Answer

This is the self-evaluation trap. Models exhibit sycophancy and generosity bias when judging their own output — the same bias present in LLM-as-judge systems. The fix is architectural: never let the Generator evaluate its own work. Deploy a separate Evaluator in its own context window with a harshly-tuned system prompt. The Evaluator should use live verification tools to test the artifact, not just read the Generator's claim that it works. Significant prompt tuning is required to make the Evaluator genuinely harsh — out of the box, LLMs defer bugs rather than blocking on them.

Question 9

My long-running agent keeps losing track of what it's building — how do I fix that?

Accepted Answer

This is context rot. Stop relying on the context window as memory. Instead, use the file system as shared state: featurelist.json for features, a progress file for completion state, timestamped learnings logs for decisions, and an init script so new sessions don't re-derive setup. Use JSON files rather than markdown — models tend to overwrite markdown. For models with severe context rot, start fresh context windows per feature and orient the agent using persistent artifacts at session start.

Question 10

What should I do when the Generator can't improve on a failing criterion after multiple attempts?

Accepted Answer

Discard the current attempt entirely and restart from scratch rather than continuing to patch the same broken approach. This is a core design principle of the framework: the ability to course-correct over long time horizons is the main advantage over single-session self-review. The harness should detect when hill climbing has stalled on a specific criterion and trigger a fresh attempt with a different generative approach rather than accumulating patches on a fundamentally flawed implementation.

Question 11

My Evaluator keeps finding bugs but deferring them instead of blocking — how do I fix that?

Accepted Answer

This is a known default behavior: out of the box, LLMs make bad QA agents and will find a bug and suggest 'fix in 2 weeks' rather than blocking. You need significant prompt tuning effort upfront. Explicitly instruct the Evaluator that any failing criterion is a hard block — no deferral, no 'nice to have' language. Calibrate with few-shot examples showing a bug being found and the correct response (block and require fix) versus the wrong response (defer). Read Evaluator transcripts after each run to catch and correct any remaining deferral patterns.

Question 12

How does the Planner-Generator-Evaluator framework compare to AutoGPT or similar autonomous agent frameworks?

Accepted Answer

AutoGPT and similar frameworks typically use a single agent loop with self-reflection, which falls prey to sycophancy bias — the agent certifies its own work as complete. The Planner-Generator-Evaluator framework enforces role separation with independent context windows and adversarial evaluation using live verification tools. It also uses file-system-based state management rather than relying on context-window memory, and it builds in the ability to discard and restart failing approaches rather than patching indefinitely.

Question 13

How is the Planner-Generator-Evaluator framework different from just using a code review agent?

Accepted Answer

A code review agent typically reads diffs in the same context or a shared context with the builder, which muddies adversarial pressure. The Evaluator in this framework operates in a completely separate context window, sees only the output artifact (never the Generator's reasoning trace), and uses live verification tools like Playwright to actively test features — not just review code. Additionally, the contract negotiation phase ensures the Evaluator grades against agreed criteria, not subjective impressions formed during code reading.

Question 14

Is the Planner-Generator-Evaluator framework inspired by GANs?

Accepted Answer

Yes, the Generator-Evaluator dynamic is explicitly inspired by GANs. The key insight is the asymmetry: tuning a standalone critic to be harsh is tractable, but tuning a builder to be self-critical is not. It is far easier to critique a meal than to cook one. The framework exploits this asymmetry by giving each role its own context window, system prompt, and job — creating productive adversarial tension that drives genuine quality improvement rather than self-certification.

Question 15

How do I adapt my agent harness when a new model generation is released?

Accepted Answer

Identify the spiky behaviors of the new model — which failure modes have been resolved and which remain. Run a simplified version of your harness (e.g., single continuous session instead of per-feature resets) and compare output quality and cost. Remove scaffold components that are no longer load-bearing: if context anxiety is gone, remove forced session resets; if coherence holds over 2 hours, reduce sprint decomposition granularity. The harness structure (Planner-Generator-Evaluator) stays the same, but the specific scaffolding evolves.

Question 16

Should I use JSON or markdown for persistent state files in a long-running agent?

Accepted Answer

Use JSON files for feature lists, progress tracking, and learnings logs. Models tend to overwrite markdown files during operation, destroying accumulated state. JSON's structured format makes models less likely to casually rewrite the entire file. This is especially important for featurelist.json and progress files that must persist accurately across multiple agent sessions and context window resets over a multi-hour run.

Question 17

How many contract criteria should I aim for per sprint?

Accepted Answer

Target 20–30 granular contract criteria per sprint for meaningful, actionable grading. Fewer than 20 criteria tend to be too vague, producing generic critiques the Generator can't act on. Each criterion should be testable and specific — for example, 'drag-and-drop reordering must persist on page reload; verify by dragging three tasks and refreshing' rather than 'drag-and-drop should work.' The contract is negotiated between the Generator and Evaluator before any code is written.

Question 18

What is progressive disclosure in the context of AI agent skills?

Accepted Answer

Progressive disclosure is a context-efficiency technique where only the front matter of a skill or tool description is loaded into the context window initially; the full body is loaded only if that skill is instantiated. This reduces upfront context consumption, which is critical for long-running agents where every token of context matters. It allows the agent to have access to many skills without paying the full context cost until a specific skill is actually needed.

Question 19

Can the Planner-Generator-Evaluator framework be used for non-coding tasks?

Accepted Answer

Yes, the framework applies to any complex creative or analytical task that benefits from adversarial evaluation over extended sessions. The core architecture — high-level planning, iterative generation, and independent evaluation against a negotiated contract — works for content generation, data pipeline construction, research synthesis, or any domain where quality is subjective and self-evaluation is unreliable. The verification tools change (e.g., Playwright for web apps becomes domain-specific validators), but the three-role structure and file-based state management remain the same.

Question 20

What does 'reading the traces' mean and why is it the most important debugging step?

Accepted Answer

Reading the traces means going through the full agent transcripts by hand, line by line, after each run. You're looking for every point where the Evaluator's judgment diverged from yours and understanding why the model made each decision. This is like reading a stack trace — you empathize with the model's reasoning to determine which scaffold components to delete, adjust, or keep. Running more experiments without reading traces is a false shortcut because prompt tuning cannot be precise without understanding the specific failure points.

Question 21

How do I prevent the Planner from over-specifying and causing cascading errors?

Accepted Answer

Instruct the Planner to output only sprint-level feature lists with high-level descriptions — never granular technical decisions like framework choices, database schemas, or API designs. Save the output as featurelist.json with each feature described in one or two sentences. Any mistake in granular technical specification at the planning stage magnifies across every subsequent sprint over a multi-hour horizon. Let the Generator and Evaluator resolve technical details through their contract negotiation.

Question 22

What are spiky behaviors in AI models and why do they matter for harness design?

Accepted Answer

Spiky behaviors are the specific, model-generation-level failure modes or weaknesses that a harness must compensate for. Examples include context anxiety (rushing to finish near the context limit), context rot (coherence degrading over long sessions), sycophancy bias (self-certifying poor work), or reluctance to discard and restart. Identifying your current model's spiky behaviors is the core discipline of harness design — each spike determines which scaffold component is needed, and each model upgrade may resolve or introduce different spikes.

Question 23

What is the cost of running a Planner-Generator-Evaluator harness compared to a single agent?

Accepted Answer

The harness uses more tokens because it runs three separate context windows and includes negotiation, evaluation, and potential restart cycles. However, it produces higher-quality output with fewer wasted iterations than a single agent that self-certifies broken features. The cost is also tunable: you can adjust Evaluator cadence (per-sprint vs. end-of-generation), reduce contract negotiation rounds for simpler features, and remove unnecessary scaffolding as model capabilities improve. The cost-quality tradeoff improves with each model generation.

Frequently Asked Questions About Anthropic Planner-Generator-Evaluator Long-Agent Framework

// Basics