How Platform Teams Build Reliable AI Agent Pipelines
For Platform engineering teams building internal AI developer tools · Based on Nick Nisi Harness Engineering for AI Agents
// TL;DR
Platform engineering teams building internal AI developer tools need agent pipelines that work reliably at scale without requiring human babysitting. Harness Engineering gives you a state-machine architecture with five agent stages (Implementer, Verifier, Reviewer, Closer, Retrospective), hard gates that prevent advancement without cryptographic evidence, and a retrospective memory loop that makes the system self-improving. Use it when your agents must autonomously handle issues, PRs, and deployments across your organization's repos without hallucinating completion.
Why do platform teams need Harness Engineering for AI agents?
Platform teams are responsible for internal developer tools that must work reliably across the entire organization. When you deploy an AI agent pipeline that autonomously handles GitHub issues, PRs, or deployment tasks, failures don't just affect one developer — they affect every team consuming your platform.
The core problem is trust. When an agent says "tests pass" or "bug fixed," can you trust that claim? Harness Engineering answers this with structural enforcement: the agent cannot advance through the pipeline without producing mechanically verifiable evidence. Your platform's reliability is no longer dependent on prompt compliance.
How do you architect a harness for multi-team, multi-repo environments?
Start with Nick Nisi's Case architecture: a TypeScript state machine orchestrating five agents with hard gates between each stage. For platform teams, the key architectural decisions are:
1. Per-repo memory files: Each repository gets its own markdown memory file maintained by the Retrospective Agent. When an agent works on repo X, only repo X's gotchas and lessons are loaded — never the entire organization's knowledge base. More context degrades performance.
2. Standardized evidence artifacts: Define org-wide evidence artifact standards. For backend services, require SHA-256 hashed test output. For frontend components, require Playwright before/after videos. For infrastructure changes, require diff verification against expected state. Standardization means every team's PRs arrive with proof attached.
3. Centralized eval suites: Build eval suites per repository and framework. Run them after any change to harness logic, skills, or gotchas. Nick Nisi demonstrated that 553 lines of targeted gotchas can outperform 10,000 lines of comprehensive docs — but only if you measure.
How do you prevent agents from hitting the same landmines across teams?
The Retrospective Memory Loop is your most powerful platform-level tool. After every agent run — success or failure — the Retrospective Agent reads the full JSONL execution transcript and updates per-project memory files with lessons learned.
For platform teams, extend this to shared framework-level memory files. If three different teams' agents all struggle with the same authentication middleware gotcha, that lesson should propagate to every repo using that middleware. The key is selective loading: surface only relevant gotchas, never dump the full memory corpus into context.
Apply the principle "Every Failure Is a Harness Bug" at the platform level. When an agent fails in any team's repo, don't patch the output — fix the gate, memory file, or gotcha that allowed the failure. The next run should structurally prevent the same mistake.
What does the rollout path look like for platform teams?
Start with a minimum viable harness: Implementer → Verifier → Closer with one evidence artifact gate. Deploy it on a single high-volume repo where agent failures are currently painful. Measure pass rate with evals. Add the Reviewer stage and Retrospective Agent once the basic pipeline is stable. Expand to additional repos only after memory files and gotchas are generating measurable improvements.
The goal is a self-improving system where your platform team's job shifts from fixing agent outputs to improving the harness environment. Every failure makes the system stronger. Trust is a pass rate — and your platform dashboard should display it.
Ready to implement? Start by mapping your highest-failure agent tasks, defining provable Definitions of Done for each, and building your first state machine gate.
// FREQUENTLY ASKED QUESTIONS
How do platform teams scale Harness Engineering across multiple repositories?
Use per-repo memory files and gotchas loaded selectively by the harness based on the target repository. Maintain shared framework-level memory for cross-cutting concerns like authentication or database patterns. Standardize evidence artifact formats org-wide so every PR arrives with the same proof structure. Run per-repo eval suites and aggregate pass rates into a platform dashboard.
How many gotcha files does a typical platform team maintain?
One gotcha file per repository plus shared files per framework or middleware used across repos. Keep each file under 600 lines — targeted to specific landmines agents reliably hit, not comprehensive documentation rewrites. The Retrospective Agent appends new lessons automatically; your team periodically prunes outdated entries and validates with evals.
Can Harness Engineering integrate with existing CI/CD pipelines?
Yes. The harness state machine can be triggered by CI/CD events and its gates can integrate with existing CI checks. Evidence artifacts (test hashes, build logs) can be generated by your existing CI infrastructure and verified by the harness's Verifier gate. The harness wraps your agent execution; it doesn't replace your deployment pipeline.