How PMs Evaluate the Planner-Generator-Evaluator Agent Pattern

For AI product managers and technical leads evaluating agent architectures · Based on Anthropic Planner-Generator-Evaluator Long-Agent Framework

// TL;DR

For product managers and technical leads evaluating agent architectures, the Planner-Generator-Evaluator framework is the current best-practice pattern for agents that need to run coherently beyond 30 minutes. It solves three specific failure modes: self-evaluation bias (agents approving their own bugs), context rot (coherence degradation over long sessions), and inability to course-correct (patching broken approaches instead of restarting). Evaluate it when your team is building autonomous development pipelines, diagnosing quality issues in existing agent systems, or planning architecture for multi-hour agentic workflows.

What problem does the Planner-Generator-Evaluator pattern solve?

It solves the quality collapse that occurs when AI agents run for extended periods. Three specific failure modes cause this: self-evaluation bias (the agent approves its own mediocre output due to sycophancy), context rot (coherence degrades as the context window fills), and context anxiety (the model rushes to finish as it approaches context limits).

Single-agent loops like RALF — which feeds a prompt into a coding agent CLI on repeat — have a fixed plan with no adversarial pressure. They are 'deterministically bad in an undeterministic world.' The PGE pattern addresses this by separating concerns into three specialized roles with independent context windows, adding the adversarial evaluation pressure needed for genuine quality improvement over multi-hour horizons.

How does it compare to other agent architecture patterns?

Compared to single-agent loops (RALF, basic ReAct), PGE adds role separation, adversarial evaluation, and the ability to discard failing approaches. The cost is higher per run, but output quality is significantly better for complex tasks.

Compared to AutoGPT/BabyAGI-style self-directed agents, PGE uses file-based contracts and persistent artifacts instead of relying on in-context planning and memory. The structural separation of builder and critic prevents the self-evaluation trap that causes those architectures to approve poor work.

Compared to human-in-the-loop approaches, PGE can run autonomously for hours. The human role shifts from real-time oversight to post-run trace analysis and rubric tuning — a more efficient use of expert time.

The key differentiator is the Generator-Evaluator contract negotiation. Before any code is written, both agents agree on what 'done' means through 20-30 granular criteria. This converts fuzzy user stories into testable assertions without requiring the Planner to over-specify.

What are the key decision points for adopting PGE?

Task duration: If your agents consistently run under 30 minutes, a simpler architecture may suffice. PGE's overhead — contract negotiation, separate context windows, live verification — pays off primarily for multi-hour sessions.

Quality requirements: If you need production-grade output across subjective dimensions (design, originality, craft) and not just functional correctness, PGE's adversarial evaluation and rubric calibration provide the necessary quality control.

Model evolution strategy: The harness and model co-evolve. Components that are load-bearing for one model generation may be redundant for the next. Your team needs a process for regularly reassessing and simplifying the harness as models improve. This is ongoing work, not a one-time architecture decision.

Debugging investment: The primary debugging loop is reading agent transcripts by hand. Teams must budget time for trace analysis and rubric tuning, especially during initial deployment. Plan for significant prompt tuning effort to make the Evaluator genuinely harsh — out of the box, LLMs make bad QA agents.

What metrics should I track to evaluate PGE effectiveness?

Track four things: (1) Feature completion rate — percentage of planned features that pass all contract criteria. (2) Restart rate — how often the harness discards and restarts a feature, and whether that rate decreases as you tune the rubric. (3) Evaluator-human alignment — how often the Evaluator's grading matches your team's judgment on the same output. (4) Harness simplification rate — how many scaffold components you've removed over time as models improve.

Start with a proof-of-concept sprint on a well-understood feature to calibrate your rubric and validate the architecture before committing to full production deployment.

// FREQUENTLY ASKED QUESTIONS

How long does it take to implement a PGE harness from scratch?

Expect 2-3 days for the initial harness implementation — setting up three agent roles with separate context windows, file-based communication, and Playwright integration. Then allocate 1-2 weeks for Evaluator calibration through iterative trace reading and rubric tuning. The architecture is straightforward; the calibration of the Evaluator's taste and harshness is where the real effort goes. Start with a single-feature sprint to validate before scaling.

What's the ROI argument for PGE versus simpler agent architectures?

PGE costs more per run in API tokens but produces output that requires significantly less human rework. The discard-and-restart mechanism prevents hours spent patching broken approaches. For teams building production applications, the total cost — API fees plus human review time — is typically lower with PGE for tasks over 2 hours. The key metric is human rework hours saved per feature, not API cost per token.

Does the PGE framework require specific models or providers?

No — the framework is structurally model-agnostic. The three-role separation with file-based contracts works with any capable language model. However, harness tuning is model-specific: you must identify each model's spiky behaviours (context rot thresholds, tool-calling reliability, sycophancy levels) and adjust scaffold components accordingly. The framework explicitly expects the harness to evolve as you change models — this is a feature, not a limitation.