Frequently Asked Questions About Nick Nisi Harness Engineering for AI Agents

23 answers covering everything from basics to advanced usage.

// Basics

What is the difference between a gate and a prompt instruction in AI agent pipelines?

A gate is code that structurally blocks the pipeline from advancing until a required evidence or condition is satisfied. A prompt instruction is text that asks the agent to do something — which it can ignore, forget, or hallucinate compliance with. Gates are enforcement; instructions are suggestions. In Harness Engineering, every critical checkpoint is a gate, not an instruction, because agents reliably skip instructed steps but cannot bypass code-level enforcement.

What is the Case implementation in Nick Nisi's Harness Engineering?

Case is Nick Nisi's specific harness implementation: a TypeScript state machine orchestrating five specialized agents — Implementer, Verifier, Reviewer, Closer, and Retrospective — with enforced gates between each stage. The Implementer attempts the task, the Verifier checks evidence artifacts, the Reviewer assesses code quality, the Closer packages deliverables with proof, and the Retrospective agent updates memory files from execution logs.

What does 'trust is a pass rate' mean?

It's Nick Nisi's formulation that human trust in an agent system must be grounded in a measurable score — pass rate, delta score, hash verification — not subjective confidence or the agent's self-assertion. You don't trust an agent because it sounds confident; you trust it because it passes 94% of your eval suite. If you can't measure trust numerically, you don't have it.

What is the JSONL transcript and what should it capture?

The JSONL transcript is a structured log of every action during an agent execution run. Each line is a JSON object capturing: the agent stage, the tool called, the input provided, the output received, timestamps, and state transitions. The Retrospective Agent reads this transcript to identify doom loops, skipped steps, and inefficiencies. It's the raw data that makes the retrospective memory loop possible. Without it, the harness cannot learn from its own execution history.

// How To

How do I create memory files for my AI agent harness?

Memory files are per-project and per-framework markdown documents maintained automatically by the Retrospective Agent. After each run, the Retrospective reads the JSONL execution transcript, extracts lessons (doom loops, failure patterns, successful strategies), and appends them to the appropriate memory file. You seed the initial memory files with known gotchas from your codebase. Over time, they accumulate institutional knowledge that prevents agents from repeating mistakes.

How do I write a provable Definition of Done for an AI agent task?

Specify a concrete, verifiable success condition that can be mechanically checked without trusting the agent's self-report. Examples: 'all tests pass and the SHA-256 hash of test output is attached,' 'Playwright video shows the button renders correctly before and after the fix,' or 'the diff shows only the auth middleware file was modified.' If you cannot define a mechanical verification method, redesign the task until you can. Ambiguous done-states let agents lie.

How do I run evals for my AI agent harness?

Build a structured test suite of defined tasks with known correct outcomes. Run the full suite before making any change to skills, gotchas, or harness logic to establish a baseline pass rate. After making changes, re-run the suite and compare. If pass rate drops, revert the change. Evals are the only reliable way to distinguish improvements from noise. Without them, you'll unknowingly ship degradations while assuming you're improving the system.

How do I build a state machine for AI agent orchestration?

Define discrete states for each agent stage: Implementing, Verifying, Reviewing, Closing, Retrospecting. Between each state, implement a transition gate — code that checks for a required condition (evidence artifact, reviewer approval) before allowing the state change. Use a structured format like TypeScript enums or a formal state machine library. Log every tool call and state transition to JSONL. The agent cannot self-transition; only the harness can advance the state.

How often should I run evals on my agent harness?

Run evals before and after every change to skills, gotchas, prompts, or harness logic — no exceptions. This is the 'Measure, Don't Assume' principle. Also run evals periodically (weekly or after model updates) to detect drift from upstream model changes. If you change nothing and pass rates drop, the model provider may have updated the model. Evals are your only ground truth for whether the system is improving, stable, or degrading.

// Troubleshooting

My agent keeps skipping the test step even though I told it to run tests. What should I do?

Stop telling it and start enforcing it. Remove the prompt instruction to run tests entirely. Instead, add a harness gate that requires a test artifact file containing the SHA-256 hash of actual test output. The state machine checks that this hash exists and is valid before the Verifier can proceed. The agent now cannot advance without the artifact, making it structurally easier to run the tests than to skip them. This is the 'Enforce, Don't Instruct' principle in action.

My agent is stuck in a doom loop calling the same tool repeatedly. How do I fix it?

First, the Retrospective Agent should detect this pattern in the JSONL transcript (same tool called 3+ times with no state change). Then treat it as a harness bug: identify what context or gate was missing that caused the loop. Common fixes include adding a gotcha about the specific API the agent was misusing, adding a circuit-breaker gate that forces a different approach after N identical calls, or providing a memory file with the correct usage pattern from a previous successful run.

I added more documentation to my agent's context but performance got worse. Why?

More tokens can actively degrade agent performance. Models have finite attention, and irrelevant context dilutes focus on the information that matters. The 'Measure, Don't Assume' principle addresses this directly: always run evals before and after adding content. Nick Nisi demonstrated that 553 lines of targeted gotchas outperform 10,000 lines of comprehensive docs. Delete any skill content that reduces pass rate. Guide with gotchas, don't prescribe with documentation dumps.

The agent completes the task successfully but the Retrospective still finds issues. Is that normal?

Yes, and it's by design. The Retrospective runs on every execution — success or failure — because even successful runs contain inefficiencies: unnecessary tool calls, suboptimal approaches, near-misses that happened to work. Skipping retrospectives on successful runs is explicitly listed as a pitfall. Continuous improvement comes from analyzing all runs, not just failures. The Retrospective's job is to make future runs faster and more reliable, not just to fix broken ones.

What happens when the Verifier agent itself makes a mistake?

This is why evals exist at the system level. If the Verifier consistently passes bad work or rejects good work, your eval suite will catch the degraded pass rate. Treat Verifier errors as harness bugs: update the Verifier's evidence-checking logic, add clearer artifact format requirements, or add gotchas about verification edge cases. The Retrospective Agent can also flag Verifier inconsistencies if the Reviewer repeatedly catches issues the Verifier missed.

// Comparisons

How does Harness Engineering compare to LangChain or CrewAI for building agent pipelines?

LangChain and CrewAI are agent frameworks that provide abstractions for chaining LLM calls and tool usage. Harness Engineering is a methodology that can wrap any framework — including LangChain or CrewAI — with external enforcement. The key difference is philosophy: frameworks give agents capabilities; the harness constrains and verifies agent behavior through gates, evidence artifacts, and retrospective memory. You can implement Harness Engineering principles on top of any agent framework.

How is Harness Engineering different from just adding unit tests to my agent pipeline?

Unit tests verify your code works; Harness Engineering verifies your agent actually did the work and did it correctly. Unit tests run at development time; harness gates run at execution time as structural enforcement within the pipeline. The harness also includes retrospective memory (learning from failures), evidence artifacts (cryptographic proof), and gotcha files (targeted context). Unit tests are one possible evidence artifact within a harness, but the harness is the complete enforcement and learning system.

Can I use Harness Engineering with Claude, GPT-4, or any LLM?

Yes. Harness Engineering is model-agnostic because the enforcement happens outside the model. The harness is an external state machine that controls execution flow, requires evidence artifacts, and manages memory files regardless of which LLM powers the agent. The principle 'Enforce, Don't Instruct' works precisely because enforcement is structural, not prompt-dependent. You may need different gotcha files per model since different LLMs fail on different things — measure with evals to calibrate.

// Advanced

What's the minimum viable harness I can start with?

Start with three components: a state machine with at least Implementer → Verifier → Closer stages, one evidence artifact requirement (e.g., test output hash), and a basic retrospective that logs failures to a markdown file. You don't need all five agents or sophisticated JSONL analysis on day one. Add gates and memory complexity as your eval suite reveals specific failure patterns. The key minimum is that at least one gate structurally prevents advancement without proof.

How do I decide which gotchas to include in my agent's context?

Run your eval suite without any gotchas to establish a baseline. Analyze which tasks the agent fails and why — look for patterns of implicit contract violations, not general coding errors. Write gotchas only for those specific failure points. Re-run evals to confirm each gotcha improves pass rate. If a gotcha doesn't measurably help, delete it. Aim for under 600 lines total. The model already knows how to code; expose only what it cannot know about your product's hidden rules.

How do I handle multi-repo tasks in Harness Engineering?

Load memory files and gotchas specific to each repo involved in the task. The harness should understand repo boundaries and apply per-repo context selectively. Define evidence artifacts that span the full task — for example, an integration test that exercises the interaction between repos. Gates should verify cross-repo consistency, not just per-repo correctness. The Retrospective Agent should tag lessons with the specific repo they apply to for accurate future context loading.

Should I use Harness Engineering for one-shot simple tasks or only for complex pipelines?

Harness Engineering provides the most value for multi-step autonomous tasks where agents must complete a chain of actions without human intervention. For simple one-shot tasks with easy human verification (e.g., 'write a function that adds two numbers'), the overhead of gates and evidence artifacts may not be justified. The framework is designed for scenarios where agent hallucination, step-skipping, or inconsistent completion creates real risk. Scale your harness complexity to match task complexity.

How do I version and manage memory files as they grow over time?

Store memory files in version control alongside your harness code. The Retrospective Agent appends to them, and you should periodically review and prune entries that are no longer relevant (e.g., gotchas for deprecated APIs). Run evals after pruning to ensure you haven't removed valuable context. Some teams tag memory entries with dates and invalidation conditions. Keep memory files per-project and per-framework to prevent irrelevant context from being loaded into unrelated tasks.

Can Harness Engineering work for non-coding AI agent tasks like content generation?

The principles apply to any domain where agent output must be verified, but the evidence artifacts change. For content generation, evidence might be a plagiarism check score, a readability metric, or a structured rubric evaluation by a second agent. The core framework — state machine enforcement, gates, retrospective memory — is domain-agnostic. The challenge is defining mechanically verifiable success conditions for subjective outputs; if you can't verify it mechanically, Harness Engineering's value diminishes.