How Solo Devs Use PGE to Ship Multi-Feature Apps with AI
For Solo developers and indie hackers building AI-powered products · Based on Anthropic Planner-Generator-Evaluator Long-Agent Framework
// TL;DR
For solo developers and indie hackers, the Planner-Generator-Evaluator framework turns a one-line prompt into a production-quality multi-feature application over a 4-6 hour autonomous session. Instead of babysitting a single coding agent that drifts and approves its own bugs, you set up three specialized agents: a Planner that decomposes your idea, a Generator that builds one feature at a time, and an Evaluator that adversarially tests the live app. Use it when you want to ship complete products — not prototypes — without writing every line yourself.
Why does my AI coding agent keep producing half-finished features?
Your agent is trapped in the self-evaluation loop. When the same model that writes code also reviews it, sycophancy bias kicks in — it sees a half-baked button and calls it done. This is not a prompting problem you can solve with 'be more critical.' The Anthropic team found that tuning a builder to be self-critical is fundamentally harder than tuning a separate critic to be harsh. It's easier to critique a meal than to cook one.
The fix is structural: separate the builder from the critic. The Planner-Generator-Evaluator framework gives each role its own context window and job. The Evaluator doesn't read the Generator's code comments or reasoning — it opens your app in a real browser using Playwright, clicks every button, submits every form, and grades against a contract you helped define.
How do I set this up as a solo developer?
You need four things to start:
1. Task prompt: A one-line description of what you want built. Keep it intentionally vague — 'build a retro game maker' or 'build a habit tracker with social features.' The Planner handles decomposition.
2. Quality rubric: Your written, opinionated criteria for what 'good' looks like. Pick 2-4 dimensions. If you're tired of AI slop aesthetics, weight Design and Originality heavily. Write down what you mean — 'no purple gradients, no generic card layouts, use a specific color palette inspired by [reference].'
3. Reference examples: Screenshots or code samples labeled 'this is good' and 'this is AI slop.' These calibrate the Evaluator toward your taste.
4. Model selection: Choose which models fill each role. Your best planner for the Planner, a fast coder for the Generator, a strong judge for the Evaluator.
The Planner outputs a featurelist.json and progress file. The Generator and Evaluator negotiate a contract of 20-30 specific criteria before coding starts. Then the Generator builds one feature at a time while the Evaluator stress-tests each one.
What if the agent gets stuck on a feature?
This is where the framework shines compared to manual babysitting. If the Generator can't pass a criterion after repeated attempts, the harness discards the current approach entirely and restarts that feature from scratch. No more watching an agent spend 45 minutes patching the same broken approach.
Use JSON files for all state — featurelist.json, progress tracking, learnings logs — because models tend to overwrite markdown files. The file system is your source of truth across sessions, not the context window.
How do I improve results over multiple runs?
After each run, read the full agent transcripts. Find where the Evaluator approved something you wouldn't have. Update the rubric and system prompt to close that gap. This is the primary debugging loop — reading traces, not running more experiments.
As you upgrade models, reassess which harness components are still necessary. A newer model might not need fresh context windows between features. A stronger planner might not need sprint-level decomposition. Strip components that are no longer load-bearing — the harness should get simpler as models improve.
Start with a single feature sprint to validate your setup, then scale to full multi-feature sessions.
// FREQUENTLY ASKED QUESTIONS
Do I need a server or special infrastructure to run a PGE harness as a solo dev?
No dedicated server infrastructure is required. You run three separate agent sessions — one per role — coordinated through shared files on your local disk or a Git repository. The agents communicate exclusively through persistent artifacts (JSON files, code commits). You need API access to your chosen models and Playwright installed for web app verification. The entire harness can run on a standard development machine.
How much does a typical 4-6 hour PGE session cost in API fees?
Cost varies significantly by model choice and task complexity. Three separate context windows running iteratively over 4-6 hours with contract negotiation will use more tokens than a single-agent loop. However, the discard-and-restart mechanism prevents the expensive pattern of endlessly patching broken code. Budget for the Evaluator using live verification tools — Playwright screenshots and interactions add overhead. Start with a single-sprint test to estimate costs before committing to full runs.
Can I use this framework with Claude, GPT-4, or Gemini?
Yes — the framework is model-agnostic in structure but model-specific in tuning. Identify the spiky behaviours of your chosen model: context anxiety thresholds, tool-calling reliability, tendency to overwrite files. Adjust session length, compaction strategy, and Evaluator harshness prompts accordingly. You can mix models across roles — a strong planner for the Planner, a fast coder for the Generator, a different model with strong judgment for the Evaluator.