How DevOps Engineers Build Reliable Agent Automation

For DevOps and SRE engineers automating with AI agents · Based on Lou Bichard Software Factory Primitives Framework

// TL;DR

DevOps and SRE engineers adopting coding agents for infrastructure automation need the Software Factory Primitives Framework to move from ad-hoc agent scripts to a reliable, gated pipeline. The framework prescribes the Events pattern for trigger-driven automation (alerts, incidents, CVEs), VM isolation for security-sensitive infrastructure tasks, micro-step decomposition with machine-checkable gates to prevent agents from making unchecked changes, and Harness Engineering to encode runbook knowledge into repos so agents improve over time.

Why Do My Automated Agents Keep Making Unchecked Infrastructure Changes?

You set up an agent to respond to alerts or remediate issues, and it worked — until it skipped a validation step and pushed a bad config to production. The problem isn't the agent's capability. The problem is the absence of a coordination layer with gates.

The Software Factory Primitives Framework identifies four infrastructure primitives: Runtime, Orchestration, Triggers, and Coordination. For DevOps automation, your Triggers likely work well (PagerDuty alerts, CloudWatch events, CVE feeds). Your Runtime might be containers or VMs. Your Orchestration handles spinning agents up. But Coordination — how the agent gates its own progress through a multi-step remediation — is almost certainly missing.

Without explicit micro-step gates, an agent responding to an incident might: identify the issue, attempt a fix, and apply it to production — all without verifying the fix works in staging. That's the same step-skipping problem that plagues all ungated agent pipelines.

How Do I Build Event-Driven Agent Automation That's Actually Safe?

Use the Events pattern: webhook-style triggers bring agents online without human initiation. An alert fires → an agent spins up → the agent follows a gated micro-step sequence → the result is applied or escalated.

The key is decomposing each remediation workflow into micro-steps with machine-checkable gates:

1. Receive alert — parse the alert payload, identify the affected system

2. Diagnose — query metrics, logs, or config to confirm the root cause

3. Gate: diagnosis confirmed — automated check that the diagnosis matches known patterns

4. Generate fix — produce the config change, script, or code patch

5. Test fix — apply to staging, run validation checks

6. Gate: tests pass — automated verification that staging is healthy

7. Apply fix — deploy to production

8. Gate: production healthy — post-deploy health checks pass

9. Close loop — update the incident, notify the team

At each gate, failure halts the agent and escalates to a human. The agent never self-certifies that a step is complete.

Why Do I Need VMs Instead of Containers for Agent Execution?

Containers are not a bulletproof security boundary. For DevOps agents that modify infrastructure, execute scripts with elevated permissions, or access production systems, container escapes represent an unacceptable risk. Additionally, on shared Kubernetes clusters, containers create noisy-neighbour compute contention that can degrade agent performance during critical remediation tasks.

VM isolation provides hardware-level separation via hypervisors. Each agent runs in its own VM with explicitly scoped permissions — it can only access the systems it needs to, and a compromised agent cannot affect other workloads. This is the baseline for any agent that touches production infrastructure.

How Do I Encode Runbook Knowledge So Agents Get Better Over Time?

Harness Engineering is the iterative practice of encoding operational knowledge back into the repository. After each agent-driven remediation:

1. Review whether the agent followed every micro-step correctly

2. Identify where it drifted, skipped steps, or made suboptimal decisions

3. Encode the fix: update `agents.md` with remediation-specific rules, add context files with system architecture details, write tests that catch the failure mode

Over time, your infrastructure repos become living runbooks that agents can follow with increasing reliability. This is the compounding advantage — each incident makes the system smarter.

Next step: Pick your most common alert-driven remediation workflow. Decompose it into micro-steps with machine-checkable gates. Implement VM-based agent execution for that workflow. Run it through one Harness Engineering cycle after the first live incident.

// FREQUENTLY ASKED QUESTIONS

How is this framework different from just writing automation scripts for incident response?

Traditional automation scripts are static and brittle — they break when conditions change. Agent-driven automation can reason about novel situations, but without a coordination layer and gates, agents skip steps and make unchecked changes. This framework gives you the infrastructure to let agents reason flexibly while still gating progress through machine-checkable steps. Harness Engineering ensures the system improves after each incident, unlike static scripts.

Can I use this framework with Terraform or Ansible agents?

Yes. Terraform or Ansible agents benefit from micro-step decomposition and gating. For example, a Terraform agent's workflow decomposes into: parse change request, generate plan, validate plan against policy, apply to staging, verify staging, apply to production, verify production. Gates at each step prevent ungated infrastructure changes. The coordination layer tracks state and escalates failures to humans.

How do I handle agents that need access to production secrets?

VM isolation is the baseline — each agent runs in its own VM with scoped permissions. Secrets should be injected via a secrets manager (Vault, AWS Secrets Manager) with time-limited, least-privilege access. The agent's VM should only receive the secrets needed for its current micro-step. Audit all agent permissions as part of the framework's security surface step, and monitor for anomalous secret access patterns.