How Do DevOps Engineers Build Agent Runtime and Coordination Infrastructure?

For DevOps and infrastructure engineers implementing agent tooling · Based on Lou Bichard Software Factory Primitives Framework

// TL;DR

DevOps engineers building infrastructure for coding agents need to understand the four primitives — Runtime, Orchestration, Triggers, and Coordination — and which ones they are responsible for. Runtime means VM-isolated environments (not containers) for agent execution. Orchestration means scaling agents horizontally. Triggers mean webhook-driven agent activation. Coordination — the missing primitive — means building state machines or CLI gateways that gate agent progress through SDLC micro-steps. This framework tells you what to build and why containers are insufficient.

Why Aren't Containers Enough for Running Coding Agents?

Containers are the default answer for running workloads at scale, but they are insufficient for coding agent execution. Two problems emerge:

Security isolation: Containers share a kernel with the host. A coding agent that executes arbitrary code, installs dependencies, and accesses repositories needs a stronger isolation boundary. Container escapes are a well-documented attack vector. For agents with repository write access, this is an unacceptable risk.

Noisy-neighbour compute: On Kubernetes, agents competing for CPU and memory create unpredictable performance. A coding agent that hits a resource contention wall mid-task may produce partial or broken output. At fleet scale — hundreds of agents across hundreds of repos — this becomes a systemic reliability problem.

The Software Factory Primitives Framework prescribes full VM isolation for proper development tasks. Each agent gets its own virtual machine with dedicated compute, providing both security guarantees and performance predictability.

What Infrastructure Do the Four Primitives Require?

Map each primitive to infrastructure components:

Runtime: VM provisioning and lifecycle management. Each agent session gets a VM with the repository cloned, dependencies installed, and tooling available. Tear down after task completion. Consider dev environment platforms that handle this abstraction.

Orchestration: Horizontal scaling — spin up N agent VMs in response to demand, spin down when idle. This looks like an autoscaling group or a managed compute pool. You need fast VM boot times (seconds, not minutes) to keep agents responsive.

Triggers: Webhook receivers that listen for events — PR created, ticket moved, CVE published, schedule fired — and initiate agent sessions. Standard webhook infrastructure with queue-backed processing for reliability.

Coordination: This is the missing primitive and the one most DevOps teams have not built. It requires:

- A state machine per task/repo that tracks micro-step completion

- Machine-checkable gates at each micro-step boundary

- A query interface (CLI or API) that agents call to verify "may I proceed?"

- Durable execution to survive agent crashes and restarts

- State aggregation for human oversight dashboards

How Do You Implement Machine-Checkable Gates in the Coordination Layer?

Gates are the mechanism that prevents agents from skipping steps or claiming false completion. Each gate is an independent verification that the coordination layer runs — not something the agent self-reports.

Examples of machine-checkable gates:

| Micro-Step | Gate Check |

|---|---|

| Code implementation complete | Compiles without errors |

| Unit tests written | Test files exist, test count > 0 |

| Tests pass | Test runner exits 0 |

| Coverage adequate | Coverage report > threshold |

| Lint clean | Linter exits 0 |

| PR description complete | Required fields populated |

| Security scan | No new critical vulnerabilities |

The coordination layer executes these checks independently of the agent process. If a gate fails, the coordination layer can either route back to the agent with specific feedback or flag for human intervention.

Implement gates as small, stateless verification scripts that the coordination layer invokes. Keep them fast — gate checks should take seconds, not minutes.

How Should DevOps Teams Handle Agent Credential Management?

Every agent VM needs repository access, and potentially access to package registries, APIs, and deployment targets. Credential management at scale requires:

- Short-lived credentials: Generate per-session tokens that expire when the VM is torn down

- Least-privilege access: Each agent gets only the repository access it needs for its specific task

- Audit logging: Every credential use is logged with agent ID, task ID, and timestamp

- Rotation: No long-lived secrets stored in agent environments

- Blast radius containment: If one agent VM is compromised, it cannot access other repos or escalate privileges

This is where VM isolation pays off — credential containment is cleaner when each agent runs in its own isolated environment rather than sharing a container runtime.

Next step: Start by building the VM provisioning pipeline for agent runtimes — fast boot, pre-loaded tooling, automatic teardown. Then implement one coordination gate (test runner verification) as a standalone service that agents can query. This gives you the infrastructure foundation to expand the coordination layer incrementally.

// FREQUENTLY ASKED QUESTIONS

What VM provisioning setup works best for coding agents?

Use fast-boot VMs with pre-built images containing common development tooling (language runtimes, package managers, linters, test frameworks). Target boot times under 30 seconds. Implement automatic teardown after task completion or timeout. Consider dev environment platforms like Gitpod or Coder that abstract VM lifecycle management, or build on cloud VM APIs with custom provisioning scripts.

How do I build the coordination layer as infrastructure?

Implement a durable state machine service that tracks micro-step completion per task. Expose a CLI or API endpoint that agents call to query gate status. Run gate verification scripts independently of the agent process. Store state durably so agent crashes do not lose progress. Aggregate state for a human oversight dashboard. Tools like Temporal or custom state machines work well as the execution substrate.

How do I handle agent compute costs at fleet scale?

Use VM autoscaling with aggressive teardown — agents should not keep VMs running when idle. Implement queuing so agent tasks wait for available VMs rather than spinning up unbounded instances. Monitor per-task compute costs and set budgets. Spot or preemptible VMs can reduce costs for non-urgent fleet operations, with the coordination layer handling retries when VMs are reclaimed.