How Platform Engineers Build Agent Coordination at Scale
For Platform engineering leads at mid-to-large companies · Based on Lou Bichard Software Factory Primitives Framework
// TL;DR
Platform engineering leads managing coding agents across many repositories need the Software Factory Primitives Framework to diagnose infrastructure gaps and build a coordination layer. The framework reveals that Runtime, Orchestration, and Triggers are usually solved — Coordination is the missing piece. Use it to decompose your SDLC into gated micro-steps, select the Fleet or Events pattern for org-scale automation, enforce VM isolation for security, and apply Harness Engineering so your repos progressively improve agent reliability without constant human intervention.
Why Are My Coding Agents Creating Chaos Across Repositories?
Platform engineering teams often start with a promising setup — multiple coding agents, CI/CD pipelines, and GitHub or Linear for coordination — only to find that agents create overwhelming noise, skip critical steps, and require constant human intervention. The root cause is almost always the same: Coordination is the missing primitive.
The Software Factory Primitives Framework identifies four infrastructure components every agentic pipeline needs: Runtime, Orchestration, Triggers, and Coordination. As a platform engineering lead, your Runtime (VMs, containers) and Orchestration (Kubernetes, autoscaling) are likely solid. Your Triggers (webhooks, PR events) probably work. But Coordination — how agents gate progress, hand off tasks, and collaborate — is almost certainly ad-hoc or nonexistent.
The first step is to audit all four primitives against your current setup and mark each as solved, partial, or missing. This gives you a precise diagnosis instead of a vague sense that "agents aren't working."
How Do I Coordinate Agents Across Hundreds of Repositories?
At fleet scale, you need the Fleet pattern: agents fan out across multiple repositories simultaneously, driven by schedules or triggers. Common use cases include CVE remediation across 500 repos, dependency updates, test coverage enforcement, or policy compliance.
The coordination layer for fleet operations must include:
1. Micro-step decomposition — Each automated workflow (e.g., CVE patch) is broken into explicit micro-steps: identify affected repo, locate vulnerable dependency, bump version, run tests, raise PR.
2. Machine-checkable gates — At each micro-step boundary, objective criteria determine pass/fail. Tests must pass. Lint must be clean. The agent does not self-certify completion.
3. State machine or workflow graph — A central representation of pipeline state across all repositories, enabling you to see which repos succeeded, which are blocked, and where human intervention is needed.
Do not use GitHub or Linear as this coordination layer. They are human tools that cannot handle fleet-scale agent activity without burying signal in noise.
How Do I Ensure Security When Agents Have Write Access at Scale?
Fleet-scale agent automation dramatically increases your attack surface. A compromised agent could push malicious code across hundreds of repositories. The framework prescribes VM isolation as the baseline — not containers, which are not a bulletproof security boundary and create noisy-neighbour compute contention on shared Kubernetes clusters.
Beyond VMs, audit: what permissions each agent holds, which repositories the fleet can touch, whether agents can escalate privileges, and what monitoring detects anomalous behavior. Security is a prerequisite for moving the human further out of the loop — without it, you cannot responsibly increase automation.
How Do I Keep Agents on Track as They Execute Complex Workflows?
Harness Engineering is your ongoing practice: run agents through the pipeline, identify exactly where context rot causes them to drift or skip steps, and encode fixes back into the repository. This means updating agents.md with domain-specific rules, adding context files with architectural decisions, writing unit tests that catch the specific failure modes you observe, and adding skill definitions.
The repository itself becomes progressively smarter. Each Harness Engineering cycle reduces the failure rate and moves you closer to a true software factory where work flows to production autonomously.
Next step: Run the four-primitive audit on your current agent infrastructure. Identify whether Coordination is your gap, then design a purpose-built coordination layer using the micro-step decomposition approach. Start with one high-value fleet workflow (e.g., CVE remediation) and expand from there.
// FREQUENTLY ASKED QUESTIONS
How do I audit my agent infrastructure as a platform engineer?
Assess each of the four primitives — Runtime, Orchestration, Triggers, Coordination — as solved, partial, or missing. Platform teams usually have strong Runtime and Orchestration but lack a purpose-built Coordination layer. Triggers may exist via CI/CD but aren't connected to agent lifecycle events. Mark the gaps clearly before building solutions.
Should I use containers or VMs for fleet-scale coding agents?
Use VMs. Containers are not a bulletproof security boundary and create noisy-neighbour compute contention on shared Kubernetes clusters. At fleet scale, where agents have write access to hundreds of repos and execute arbitrary commands, VM isolation is the baseline for security and reliability. Containers may suffice only for simple, stateless tasks that don't require full development environments.
How do I prevent agents from raising broken PRs across hundreds of repos?
Implement machine-checkable gates at each micro-step boundary in your coordination layer. Before a PR is raised, require that tests pass, lint is clean, and builds succeed — verified by the gate, not by the agent's self-report. This prevents sycophantic false completion signals and catches failures before they propagate across your fleet.