How Do AI Teams Build Agents That Run for Hours?

For AI engineering team leads building production agent systems · Based on Anthropic Planner-Generator-Evaluator Long-Agent Framework

// TL;DR

For AI engineering team leads shipping production agent systems, the Planner-Generator-Evaluator framework solves the core problem of quality degradation over multi-hour autonomous sessions. Instead of a single agent loop that self-certifies broken features, you deploy three roles with separate context windows — a Planner for high-level decomposition, a Generator for building, and an adversarial Evaluator with live verification tools. The framework co-evolves with model improvements: you strip scaffolding as models internalize behaviors, keeping your harness lean and your agent output production-grade.

Why do single-agent loops fail for production workloads?

Single-agent loops fail because models cannot reliably judge their own output. The same sycophancy bias that appears in LLM-as-judge benchmarks shows up in coding agents — a builder reviewing its own work calls a half-implemented button done. For production workloads running 2–6 hours, this compounds: context rot degrades coherence, context anxiety causes premature completion, and without adversarial pressure there's no mechanism to catch and fix these failures.

The Planner-Generator-Evaluator framework addresses this architecturally. Each role operates in its own context window with its own system prompt. The Generator never sees the Evaluator's context, and the Evaluator never sees the Generator's reasoning trace — only the output artifact. This separation creates genuine adversarial pressure, analogous to GANs.

How should an engineering team implement the three-role harness?

Start with role assignment and model selection. Use your most capable planning model (Opus-class) for the Planner — it runs once and sets the project direction. The Generator needs strong coding and tool-use capabilities. The Evaluator needs strong judgment and access to live verification tools like Playwright MCP.

The workflow follows a clear sequence:

1. Planner receives the user prompt and outputs featurelist.json, a progress file, and an init script.

2. Generator and Evaluator negotiate a contract of 20–30 testable criteria before any code is written.

3. Generator builds one feature at a time, writing state to JSON logs.

4. Evaluator actively tests using Playwright MCP, grades against the contract, and writes critique to shared files.

5. If hill climbing stalls, the harness discards the current approach and restarts — it never patches indefinitely.

Critical implementation detail: use JSON files for all persistent state, not markdown. Models overwrite markdown files. JSON's structured format is more resilient.

How do you maintain and evolve the harness across model generations?

The harness is not permanent — it co-evolves with the frontier. After every major model release, identify the new model's spiky behaviors. Run a simplified version of your harness and compare output quality. If context anxiety is resolved, remove forced session resets. If coherence holds over 2-hour sessions, reduce sprint granularity. If the model can self-discard failing approaches, simplify the restart logic.

The primary debugging loop is reading agent transcripts by hand. Find every point where the Evaluator's judgment diverged from yours. Update the system prompt and rubric to close that gap. This is not optional — running more experiments without reading traces is a false shortcut.

What operational metrics should the team track?

Track these across runs: contract negotiation rounds per sprint, Evaluator pass rate on first attempt, number of full discards-and-restarts, total tokens consumed per completed feature, and rubric dimension scores over time. These metrics tell you where the harness is working (high first-attempt pass rates) and where it needs tuning (frequent restarts on specific rubric dimensions).

The next step is to build your first harness for a contained project — a single multi-feature web app — and read every line of the resulting transcripts before scaling to more complex production workloads.

// FREQUENTLY ASKED QUESTIONS

How many engineers does it take to run a Planner-Generator-Evaluator harness?

One engineer can operate the harness for a single project. The human role is primarily harness design, rubric authoring, and trace reading — not hands-on coding. For teams, one engineer typically owns the harness architecture and rubric calibration while others contribute domain-specific quality criteria. The agents do the building and QA autonomously; the engineer's job is to ensure the Evaluator's taste matches the team's standards.

Can I run multiple Generator-Evaluator pairs in parallel?

Yes, you can parallelize at the feature level — each Generator-Evaluator pair works on a different feature from the featurelist.json independently. The Planner's sprint decomposition naturally supports this. Use the shared progress file and Git repo to coordinate. Be careful with features that have dependencies; the Planner should identify these so dependent features run sequentially while independent ones run in parallel.

How do I justify the higher token cost of a three-agent system to leadership?

Frame it as cost per quality-passing feature, not cost per token. A single-agent loop may use fewer tokens but produces output that fails QA and requires human rework. The three-agent harness produces output that passes an adversarial Evaluator with live verification — reducing downstream human review time. Track the discard-and-restart rate; as you tune the harness and models improve, this rate drops and per-feature cost converges toward single-agent costs with higher quality.

Full skill: Anthropic Planner-Generator-Evaluator Long-Agent Framework Extended FAQ More by AI Engineer All framework skills