How Solo Devs Use AI Agents to Build Full Apps

For Solo developers and indie hackers building AI-powered products · Based on Anthropic Planner-Generator-Evaluator Long-Agent Framework

// TL;DR

For solo developers and indie hackers, the Planner-Generator-Evaluator framework lets you ship complete, multi-feature applications by delegating the build-and-QA cycle to autonomous AI agents. You provide a one-line prompt like 'build a retro game maker' and a quality rubric reflecting your taste. The Planner breaks it into sprints, the Generator builds one feature at a time, and the Evaluator stress-tests each feature with live tools. You get production-grade output from 4–6 hour autonomous sessions instead of spending days coding manually.

Can a solo developer really ship a full app using AI agents?

Yes, but only if the agent system has adversarial evaluation built in. A single AI coding agent — even a powerful one — will self-certify half-finished features as complete. The Planner-Generator-Evaluator framework solves this by splitting the work into three roles with separate context windows: a Planner that decomposes your idea, a Generator that builds features one at a time, and an Evaluator that actively tests the running app using tools like Playwright MCP.

The key insight for solo developers: you are not the builder anymore. You are the harness designer and taste curator. Your job is to write down what 'good' looks like (the rubric) and read the agent transcripts afterward to tune the system.

How do I get started with my first Planner-Generator-Evaluator build?

You need four inputs to begin:

1. A one-line task prompt: Keep it intentionally vague. "Build a habit tracker with social features" is better than a detailed spec. The Planner handles decomposition.

2. A quality rubric: Write your opinionated criteria across 2–4 dimensions. For a consumer app, try Design, Originality, Craft, and Functionality — weighted toward Design and Originality since AI models already handle Functionality well.

3. Model selection: Pick your models for each role. If budget is tight, use your strongest model for the Evaluator and a cheaper model for the Generator.

4. Reference examples: Screenshots or code samples showing your taste. Label them 'this is good design' and 'this is AI slop.' These calibrate the Evaluator.

The Planner outputs a featurelist.json and progress file. The Generator and Evaluator negotiate a contract before any code is written. Then the build loop runs autonomously — you can walk away and come back to a completed progress file.

What do I do when the AI-generated design looks like 'AI slop'?

This is the most common quality problem for solo developers using AI agents. Generic layouts, purple gradients, and visually undifferentiated output are the default aesthetic of uncalibrated AI generation.

The fix is rubric calibration. Create a rubric weighted heavily toward Design and Originality. Provide the Evaluator with few-shot screenshots: examples of designs you admire labeled 'good' and examples of generic AI output labeled 'bad.' The Evaluator takes Playwright screenshots of the Generator's work and scores against your rubric.

Critically, if the Evaluator scores Originality as consistently low across multiple rounds, the harness should discard the current design direction entirely and restart — not iterate on the same failing aesthetic. After 5–15 rounds, the output converges toward the rubric's defined taste.

How do I debug when the agent session produces poor results?

Read the transcripts. This is not optional and there is no shortcut. Go through the full agent transcripts line by line after each run. Find every point where the Evaluator accepted something you would have rejected, or where the Generator made a decision that led to drift.

Common patterns to look for:

- The Evaluator found a bug but deferred it ('fix later') instead of blocking

- The Generator started patching a broken approach instead of restarting

- Contract criteria were too vague to produce actionable critique

- The Planner over-specified technical details that cascaded into errors

Update the Evaluator's system prompt and rubric to close each gap. Over 3–5 runs, your harness gets dramatically better at producing output matching your taste.

Start with a small project — a single-page app with 3–5 features — and iterate on your harness before attempting a full-stack build.

// FREQUENTLY ASKED QUESTIONS

How long does a typical Planner-Generator-Evaluator session take for a small app?

A small app with 5–8 features typically takes 2–4 hours of autonomous agent runtime. The Generator builds one feature per loop iteration, and the Evaluator adds evaluation overhead per feature. Discards and restarts add time but improve final quality. Your active time is much shorter — primarily writing the initial rubric (30–60 minutes) and reading transcripts afterward (30–60 minutes per run).

What's the cheapest way to run the three-agent harness as a solo developer?

Use a capable but cost-efficient model (like Sonnet-class) for the Generator since it runs the most tokens, and reserve your strongest model (Opus-class) for the Evaluator and Planner which run fewer tokens but need the best judgment. Use JSON files for state management to avoid wasted tokens from context rot. As you tune the harness, the discard-and-restart rate drops, further reducing cost per completed project.

Do I need to know how to code to use this framework?

You need enough technical understanding to write meaningful rubric criteria, read agent transcripts, and set up the initial harness infrastructure (model API configuration, Playwright MCP setup, file system organization). You do not need to write the application code yourself — that's the Generator's job. However, the ability to read code helps you understand why the agent made specific decisions when debugging transcripts.

Full skill: Anthropic Planner-Generator-Evaluator Long-Agent Framework Extended FAQ More by AI Engineer All framework skills