Nick Nisi Harness Engineering for AI Agents
Ship reliable AI-agent pipelines by replacing trust with cryptographic evidence, state-machine enforcement, and failure-driven memory — so your agents stop lying and start proving.
// TL;DR
Harness Engineering for AI Agents is a framework by Nick Nisi for building reliable AI-agent pipelines that replace trust with cryptographic evidence, state-machine enforcement, and failure-driven memory. Instead of prompting agents and hoping they comply, you wrap them in a harness — an external pipeline with hard gates that structurally prevent agents from advancing without verifiable proof of completion. Use it whenever you're building or auditing multi-step AI agent systems, when agents hallucinate completion, skip steps, or when you need to create AI-facing documentation that actually improves agent pass rates.
// When should you use Harness Engineering for AI Agents?
Use this skill whenever you are building or auditing an AI agent system that must autonomously complete multi-step engineering tasks. Also apply it when your agents are inconsistently completing steps, hallucinating completion, or when you are creating AI-facing documentation/skills for your own product.
// What inputs do you need before running the agent harness?
- task_sourcerequired
The triggering work item the agent must act on: a GitHub issue, PR, Linear ticket, Slack thread, or equivalent. - target_codebase_contextrequired
The repo(s), languages, and frameworks the agent will operate in. - definition_of_donerequired
A concrete, verifiable success condition — e.g. 'tests pass', 'UI bug fixed', 'auth installed'. Must be provable without trusting the agent's self-report. - known_gotchas
A list of the specific landmines, implicit contracts, or edge cases in this codebase or product that agents reliably get wrong. - existing_memory_files
Any markdown memory files the harness has previously generated for this project/framework.
// What are the core principles of Harness Engineering for AI Agents?
Enforce, Don't Instruct
Never rely on a prompt asking the agent to do something. Use a state machine or external pipeline gate to make it structurally impossible for the agent to advance without completing the required action. The agent can ignore instructions; it cannot bypass a hard gate.
Replace Trust with Evidence
Every claimed completion must be cryptographically or mechanically proven — not accepted on the agent's word. If it ran tests, the test output must be SHA-256 hashed into a verifiable artifact. If it fixed a UI bug, it must record a Playwright video before and after. Never waste review time on unproven work.
Guide, Don't Prescribe
Do not feed the agent a comprehensive dump of your docs. Instead, surface only the specific gotchas — the landmines the model reliably hits in your product. The model already knows how to code; it just needs to know where the implicit contracts and failure points are.
Measure, Don't Assume
Run evals before and after every change to skills or prompts. More context does not mean better performance — it can actively degrade it. Trust is a pass rate, a hash, a delta score. If you are not measuring, you are adding noise.
Every Failure Is a Harness Bug
When the agent makes a mistake, do not fix the agent's output — fix the harness so the harness corrects the agent. The system should be self-correcting; your job is to improve the environment, not patch individual outputs.
Retrospective Memory Loop
After every run, a retrospective agent reviews the full execution log (tool calls, loops, retries) and updates per-project markdown memory files with lessons learned. This prevents the agent from hitting the same landmine twice and enables the harness to improve automatically over time.
// How do you apply Harness Engineering for AI Agents step by step?
- 1
Identify and frame the task from its source artifact
Point the harness at the task source (issue, ticket, Slack thread). The harness — not you — gathers the required context. Do not manually brief the agent; automate context ingestion so setup time is not your bottleneck.
- 2
Define a provable Definition of Done before execution starts
Specify the evidence artifact required: test output hash, Playwright before/after video, diff of changed files, etc. If the success condition cannot be mechanically verified, redesign it until it can. Ambiguous done-states allow agents to lie.
- 3
Load relevant gotcha memory files for the target framework/project
Pull only the markdown memory files relevant to the current project and framework. Do not load the entire knowledge base. Irrelevant context degrades performance — send only targeted, landmine-specific guidance.
- 4
Run the Implementer agent inside the state machine
The implementer attempts the task. It operates inside a controlled loop — it cannot self-report completion or advance. All tool calls and outputs are logged to a structured transcript (e.g. JSONL) for the retrospective agent.
- 5
Gate on the Verifier before any review
The state machine enforces that the Verifier must confirm the implementation before the Reviewer can run. This is a hard gate — not a prompt instruction. The Verifier checks the cryptographic/mechanical evidence artifact defined in Step 2.
- 6
Run the Reviewer and loop back to Implementer if issues found
The Reviewer checks code quality. Any issues trigger a mandatory return to the Implementer — the state machine blocks forward progress until the Reviewer is satisfied. This loop continues; the agent cannot escape it by self-asserting done.
- 7
Run the Closer to attach evidence to the deliverable
The Closer only activates once the Reviewer clears the work. Its job is to package and attach the evidence artifact (video, hash, test output) to the PR or deliverable. No evidence = no PR. Human reviewers should refuse to look at output without this evidence attached.
- 8
Run the Retrospective agent on the full execution log
The Retrospective agent reads the full JSONL transcript of the run. It identifies doom loops (same tool called 3+ times with no change), skipped steps, and repeated mistakes. It writes lessons to the appropriate per-project or per-framework markdown memory file. This is non-optional — it runs every time, success or failure.
- 9
Treat every failure as a harness bug and fix the harness
Do not manually patch the agent's output. Identify which gate, memory file, or gotcha was missing that allowed the failure, then update the harness accordingly. The next run should structurally prevent the same mistake — not just hope the agent remembers your feedback.
- 10
Re-run evals after any change to skills, gotchas, or harness logic
Use an eval suite to measure pass rate before and after the change. If performance drops, revert. More content in skills is not better — 553 lines of targeted gotchas can outperform 10,000 lines of comprehensive docs. Delete anything that reduces pass rate.
// What does Harness Engineering look like in real-world scenarios?
An agent is tasked with fixing a UI regression in a React component across multiple repos. The developer wants to verify the fix before reviewing the code.
Apply 'Replace Trust with Evidence': require the agent to use a headless browser tool to record a video of the broken behavior before the fix and the working behavior after. The Closer attaches both videos to the PR. The human reviewer only looks at the PR once the videos prove the fix — otherwise the agent is sent back to the Implementer gate.
A developer is building AI-facing skills/docs for a third-party SDK that has implicit framework contracts agents consistently violate.
Apply 'Guide, Don't Prescribe' and 'Measure, Don't Assume': instead of converting the entire SDK docs into skills, run evals to identify which specific behaviors the model gets wrong. Write a targeted gotchas file (aim for under 600 lines) covering only those failure points. Re-run evals to confirm the gotchas file increases pass rate without introducing new failures. Delete any skill content that reduces pass rate.
An agent running tests keeps advancing in the pipeline without actually executing them.
Apply 'Enforce, Don't Instruct' and 'Replace Trust with Evidence': remove the prompt instruction to run tests. Instead, have the harness require a test artifact file whose content is the SHA-256 hash of the actual test output. The state machine gate checks that this hash exists and is valid before the Verifier can proceed. It is now structurally easier to run the tests than to fabricate the artifact.
// What mistakes should you avoid when implementing Harness Engineering?
- Instructing the agent to do something instead of enforcing it through a state machine gate — agents will skip, forget, or lie about completing instructed steps.
- Accepting the agent's self-report of completion as truth — always require a mechanically verifiable evidence artifact, never a text claim.
- Adding more context (longer skills, comprehensive docs) assuming it improves performance — more tokens can actively degrade results; always measure with evals before and after.
- Manually fixing the agent's mistakes on a per-run basis instead of fixing the harness — this creates no lasting improvement and keeps you as the bottleneck.
- Skipping the retrospective step on successful runs — the retrospective runs every time, not just on failures, to continuously improve memory files.
- Building skills by converting comprehensive documentation wholesale — focus exclusively on the gotchas: the specific landmines agents reliably hit in your product.
- Assuming the model does not already know how to code your domain — it usually does; your job is to expose the product-specific implicit contracts it cannot know, not to re-teach coding fundamentals.
- Never running evals — without measurement, you cannot distinguish between improvements and noise, and you will unknowingly ship degradations.
// What are the key terms in Harness Engineering for AI Agents?
- Harness
- The external pipeline, state machine, and tooling environment that wraps and controls agent execution. The harness enforces gates, manages memory, and logs transcripts — it is what you improve when agents fail, not the agent's individual output. Concept attributed to Ryan Leuppolo's Harness Engineering.
- Case
- The specific harness implementation described by Nick Nisi: a TypeScript state machine orchestrating five agents (Implementer, Verifier, Reviewer, Closer, Retrospective) with enforced gates between each stage.
- Gates
- Hard enforcement checkpoints between agent stages in the state machine. A gate is not a prompt instruction — it is code that structurally blocks the pipeline from advancing until the required evidence or condition is satisfied.
- Gotchas
- A targeted, minimal list of the specific landmines, implicit contracts, and common failure points that AI agents reliably get wrong in a particular product or codebase. Gotchas replace comprehensive documentation as the preferred input for agent skills.
- Evidence Artifact
- A mechanically verifiable proof that an agent completed a required action — e.g. a SHA-256 hash of test output, a Playwright before/after video, a structured diff. Evidence artifacts cannot be fabricated by the agent without actually doing the work.
- Retrospective Agent
- The final stage of the harness that reads the full execution transcript (JSONL logs) of a completed run, identifies inefficiencies and repeated mistakes, and writes lessons to per-project markdown memory files so future runs avoid the same failures.
- Memory Files
- Per-project and per-framework markdown files maintained by the Retrospective Agent that store accumulated lessons, gotchas, and hints. Loaded at the start of relevant tasks to give the agent context it has earned through prior failures.
- Doom Loop
- A failure pattern where an agent calls the same tool three or more times in succession with no meaningful change in state, indicating it is stuck. Detected by the Retrospective Agent in execution logs.
- Evals
- A structured test suite run against the agent system to measure pass rate on defined tasks, with and without specific skills or context loaded. The mechanism by which 'Measure, Don't Assume' is operationalised — the only reliable way to know whether a change improved or degraded performance.
- Trust is a Pass Rate
- Nick Nisi's formulation that human trust in an agent system must be grounded in a measurable score (pass rate, delta score, hash verification) — not subjective confidence or the agent's self-assertion.
// FREQUENTLY ASKED QUESTIONS
What is Harness Engineering for AI Agents?
Harness Engineering for AI Agents is a framework for building reliable AI-agent pipelines by wrapping agents in an external state machine (the harness) that enforces gates, requires cryptographic evidence of task completion, and uses retrospective memory to learn from every run. Instead of trusting agent self-reports, the harness structurally prevents agents from advancing without mechanically verifiable proof — like SHA-256 hashed test output or Playwright before/after videos.
What is a harness in the context of AI agents?
A harness is the external pipeline, state machine, and tooling environment that wraps and controls agent execution. It enforces hard gates between stages, manages per-project memory files, and logs full execution transcripts. When an agent fails, you fix the harness — not the agent's individual output. The concept is attributed to Ryan Leuppolo's Harness Engineering approach and implemented by Nick Nisi as a TypeScript state machine orchestrating five specialized agents.
How do I set up a harness for my AI agent pipeline?
Start by defining your task source (GitHub issue, Linear ticket, etc.) and a provable Definition of Done with a concrete evidence artifact like a test output hash or video recording. Then build a state machine with five agent stages — Implementer, Verifier, Reviewer, Closer, and Retrospective — with hard gates between each. Load only targeted gotcha memory files, not comprehensive docs. Every run produces a JSONL transcript that the Retrospective agent uses to update memory files.
How do I stop AI agents from hallucinating task completion?
Replace prompt instructions with structural enforcement. Instead of telling the agent to run tests, require a verifiable evidence artifact — such as the SHA-256 hash of actual test output — that the state machine gate checks before allowing the pipeline to advance. The agent cannot self-report completion; it must produce mechanical proof. This makes it structurally easier to actually do the work than to fabricate the artifact.
How does Harness Engineering compare to just writing better prompts for AI agents?
Better prompts are suggestions agents can ignore, skip, or hallucinate compliance with. Harness Engineering replaces prompt-based trust with structural enforcement — hard gates in a state machine that block pipeline progress without verifiable evidence. Prompts say 'please run the tests'; harnesses make it impossible to advance without a cryptographic hash of actual test output. The difference is like asking someone to lock the door versus installing a lock that won't open without a key.
When should I use Harness Engineering instead of a simpler agent setup?
Use Harness Engineering whenever your AI agent must autonomously complete multi-step engineering tasks, when agents inconsistently complete steps or hallucinate completion, or when you're creating AI-facing documentation for your product. If your agent pipeline is a single-shot task with easy human verification, simpler setups may suffice. But any pipeline where you need to trust the output without reviewing every line benefits from evidence-based gates and retrospective memory.
What are gotchas in Harness Engineering and how do I write them?
Gotchas are a targeted, minimal list of the specific landmines, implicit contracts, and common failure points that AI agents reliably get wrong in your codebase or product. To write them, run evals to identify which behaviors the model fails on, then document only those failure points — aim for under 600 lines. Do not convert your entire documentation; the model already knows how to code. Gotchas expose what it cannot know: your product-specific implicit contracts.
What results can I expect from implementing Harness Engineering for AI Agents?
You can expect measurably higher agent pass rates on multi-step tasks, elimination of hallucinated completions, and a self-improving system that gets better with each run. Human review time drops because reviewers only examine work with attached evidence artifacts. The retrospective memory loop prevents agents from hitting the same failure twice. Nick Nisi demonstrated that 553 lines of targeted gotchas can outperform 10,000 lines of comprehensive docs when measured by eval pass rates.
What is a doom loop in AI agent execution?
A doom loop is a failure pattern where an agent calls the same tool three or more times in succession with no meaningful change in state, indicating it is stuck. The Retrospective Agent detects doom loops by analyzing the full JSONL execution transcript after each run. Once identified, the harness is updated — typically with a new gotcha or gate — so future runs avoid the same stuck pattern.
How does the retrospective memory loop work in Harness Engineering?
After every run — success or failure — the Retrospective Agent reads the full JSONL execution transcript, identifying doom loops, skipped steps, and repeated mistakes. It writes lessons to per-project and per-framework markdown memory files. These memory files are selectively loaded at the start of future runs to give agents context earned through prior failures. This creates a self-improving system without manual intervention.
What evidence artifacts should I require from AI agents?
Evidence artifacts must be mechanically verifiable proofs that the agent actually completed the required action. Common examples include SHA-256 hashes of test output, Playwright before/after videos for UI fixes, structured diffs of changed files, and build logs. The key requirement is that the artifact cannot be fabricated without actually doing the work. The Closer agent attaches these artifacts to the PR or deliverable — no evidence means no PR.
Turn Any YouTube Video Into An AI Skill
SkillForge captures a creator's exact methodology from their video and turns it into a reusable AI skill you can invoke in Claude, ChatGPT, or any LLM.
Forge your own skill