How AI Engineering Teams Build Agents That Run for Hours
For AI engineering team leads building autonomous coding agents · Based on Anthropic Planner-Generator-Evaluator Long-Agent Framework
// TL;DR
For AI engineering teams building autonomous coding agents, the Planner-Generator-Evaluator framework solves the core problem of quality degradation over long runs. Instead of single-agent loops that drift, stall, or rubber-stamp poor output, you separate planning, building, and adversarial evaluation into distinct agents with their own context windows. Use it when building agent pipelines that need to produce production-grade code over 2-6 hour sessions — the adversarial Evaluator with live verification tools catches the bugs your Generator would otherwise approve.
Why do autonomous coding agents lose quality over long sessions?
The fundamental problem is self-evaluation bias. When a single agent both builds and reviews its own code, the same sycophancy and generosity bias present in general LLM-as-judge systems causes it to approve half-implemented features. This gets worse over time due to context rot — coherence degrades as the agent works deeper into its context window — and context anxiety, where agents near their context limit rush to finish prematurely.
The Planner-Generator-Evaluator framework addresses all three failure modes structurally. By separating the builder (Generator) from the critic (Evaluator) into agents with their own context windows, you create the adversarial pressure needed for genuine quality improvement. The Evaluator never sees the Generator's reasoning trace — only the output artifact — preserving the GAN-like tension that forces real improvement.
How do you architect a PGE harness for a production coding pipeline?
Start with role separation. Your Planner agent receives the high-level task and outputs a sprint-level featurelist.json — not a detailed technical spec. Over-specification at the planning stage causes errors to cascade across every subsequent sprint over a multi-hour horizon.
Next, instantiate the Generator and Evaluator with separate context windows and system prompts. Before any code is written, they negotiate a contract of 20-30 granular criteria via shared files on disk. The Evaluator grades against this contract using live verification tools — Playwright MCP for web apps, computer use for native apps.
For state management, use the file system as your source of truth. Persistent artifacts — featurelist.json, progress files, timestamped learnings logs — are more reliable than context-window memory across long runs. Use JSON, not markdown, because models tend to overwrite markdown files.
The critical harness mechanism: if the Generator cannot hill-climb against a criterion after repeated attempts, discard the current approach entirely and restart. This ability to course-correct is the core advantage over RALF loops or single-session self-review.
How do you debug and tune the harness after deployment?
Read agent transcripts by hand, line by line. This is the primary debugging loop — not running more experiments. Find every point where the Evaluator's judgment diverged from yours. Empathize with why the model made each decision, then update the Evaluator's system prompt and rubric to close that gap.
Regularly reassess which scaffold components are still load-bearing. Context-window resets between sessions may be critical for one model generation and unnecessary for the next. Sprint decomposition may be essential for weaker planners but removable for stronger agentic models. The harness and model co-evolve — actively hunt for components to delete as capabilities improve.
What does a mature PGE pipeline look like in practice?
A mature pipeline runs 4-6 hour sessions producing multi-feature applications with each feature verified against a negotiated contract. The Evaluator has been calibrated over multiple runs with few-shot examples of good and bad output. The harness has been stripped of scaffold components rendered redundant by model upgrades. The team reads traces after each major run and continuously tunes the rubric to maintain alignment between the Evaluator's taste and their own.
Start by implementing the three-role separation with a single feature sprint. Get the Generator-Evaluator contract negotiation working before scaling to multi-sprint sessions. Read every trace from your first five runs — the calibration work upfront saves hours of debugging later.
// FREQUENTLY ASKED QUESTIONS
How many agents do I need to run the Planner-Generator-Evaluator framework?
You need three agents minimum — Planner, Generator, and Evaluator — each with its own context window and system prompt. The Planner runs once at the start to decompose the task. The Generator and Evaluator run iteratively per feature. You can optionally add a secondary agent for transcript analysis during debugging. The key structural requirement is that the Generator and Evaluator never share a context window.
What's the cost difference between PGE and a single-agent loop for long tasks?
PGE uses more tokens per run due to three separate context windows and the contract negotiation phase. However, it produces higher-quality output that requires less human rework, and the discard-and-restart mechanism prevents the costly pattern of endlessly patching broken approaches. For tasks over 2 hours, the total cost including human review time is typically lower with PGE because the output is closer to production-ready.
Can I use different models for the Generator and Evaluator roles?
Yes — and this is often the right approach. Use your most capable planning model (e.g. Opus-class) for the Planner, a fast capable coder for the Generator, and a model with strong judgment for the Evaluator. The harness design choices are driven by model selection since capabilities and failure modes differ per model. The key constraint is that the Evaluator must be capable enough to operate live verification tools like Playwright.