How Do Platform Teams Build Agent Coordination at Scale?

For Platform engineering leads at mid-to-large companies · Based on Lou Bichard Software Factory Primitives Framework

// TL;DR

Platform engineering leads managing agent automation across many repositories need the Software Factory Primitives Framework to diagnose which infrastructure primitive is blocking scale — almost always the Coordination layer. Use it to decompose SDLC stages into gated micro-steps, select the fleet or events pattern for cross-repo operations, enforce VM isolation for security, and apply Harness Engineering so your repositories continuously improve agent reliability without human intervention.

Why Do Platform Teams Struggle to Scale Coding Agents Beyond a Single Repo?

Platform engineering teams are often the first to adopt coding agents — automating CVE remediation, enforcing test coverage, or applying policy changes across an organization. But what works in a single repo breaks down at fleet scale.

The root cause is almost always a missing coordination layer. Your agents have somewhere to run (Runtime ✓), you can spin them up and down (Orchestration ✓), and events can trigger them (Triggers ✓). But there is no mechanism for agents to gate their progress through SDLC micro-steps, hand off work, or report state in a way that humans can monitor without drowning in noise.

The Software Factory Primitives Framework gives you a diagnostic checklist: audit all four primitives, identify the gap, and build the coordination layer explicitly.

How Should Platform Teams Decompose the SDLC for Fleet-Scale Agent Operations?

The canonical five-step SDLC — plan, build, test, review, deploy — is too coarse for agents. When you tell an agent to "remediate this CVE," it may bump a dependency version without running tests, or raise a PR without verifying the fix compiles.

Decompose each operation into micro-steps with machine-checkable gates:

1. Identify affected repository

2. Locate vulnerable dependency

3. Determine target version

4. Apply version bump

5. Run unit tests — gate: tests must pass

6. Run integration tests — gate: tests must pass

7. Verify no breaking changes — gate: lint + type check clean

8. Raise PR with structured description

Each gate is verified by the coordination layer, not by the agent self-reporting. This prevents the most common fleet-scale failure: agents raising hundreds of broken PRs simultaneously.

Apply Harness Engineering by encoding CVE remediation procedures into agents.md files within each repository so agents receive repo-specific context at runtime.

What Coordination Layer Architecture Works for Cross-Repo Fleet Operations?

For fleet operations, implement a state machine per repository instance. Each repo gets its own coordination graph triggered by an event (CVE published, policy updated). The coordination layer tracks:

- Which repos have been processed

- Which micro-step each agent is on

- Which gates have passed or failed

- Where human intervention is needed

Avoid routing this through GitHub or Linear. These tools were designed for human coordination and will produce overwhelming noise at fleet scale. Instead, build a purpose-built dashboard that aggregates state: 342 repos processed, 12 gate failures requiring review, 3 repos with test infrastructure issues.

For runtime, use VM isolation — not containers. At fleet scale, noisy-neighbour compute contention on Kubernetes causes unpredictable failures, and containers do not provide sufficient security isolation for agents modifying production dependencies.

What Security Considerations Matter Most at Fleet Scale?

Scaling agent automation increases the attack surface significantly. If an agent is compromised, it could modify dependencies across hundreds of repos. Implement:

- VM isolation as the baseline runtime

- Least-privilege repository access per agent

- Credential rotation on every agent session

- Full audit logging of all agent actions

- Gate verification that is independent of the agent process

Security is a prerequisite for moving humans further out of the loop, not an afterthought.

Next step: Audit your four primitives today. Map each one as solved, partial, or missing. If Coordination is your gap — and it almost certainly is — start by decomposing one fleet operation into micro-steps with machine-checkable gates before scaling further.

// FREQUENTLY ASKED QUESTIONS

How do I coordinate coding agents across hundreds of repositories?

Use the Fleet pattern with a purpose-built coordination layer. Implement a state machine per repository triggered by events (CVE, policy update). Track micro-step completion and gate pass/fail across all repos. Surface aggregated state to humans through a dedicated dashboard — not through GitHub PRs or Linear tickets, which create unmanageable noise at fleet scale.

Why do agents raise broken PRs when remediating CVEs at scale?

Agents skip verification steps when the SDLC is not decomposed into explicit micro-steps with machine-checkable gates. Without a gate requiring tests to pass before PR creation, agents will bump dependency versions and raise PRs without verifying the fix works. Implement independent gate checks at each micro-step boundary in your coordination layer.

Should platform teams use containers or VMs for agent execution?

Use VMs for proper development tasks. Containers are not a bulletproof security boundary and create noisy-neighbour compute contention at scale on Kubernetes. Full VM isolation provides the security guarantees and compute reliability needed when agents are modifying code and dependencies across an organization's repositories.

Full skill: Lou Bichard Software Factory Primitives Framework Extended FAQ More by AI Engineer All framework skills