Burtenshaw AI Systems Engineering via Coding Agents

Last updated: 21 May 2026

Apply coding agents to tackle hard AI systems engineering problems — from writing optimized CUDA kernels to fine-tuning LLMs to running a fully autonomous multi-agent AutoLab research pipeline.

// TL;DR

Burtenshaw AI Systems Engineering via Coding Agents is a framework for pushing coding agents beyond routine software tasks into hard AI/ML systems engineering. It defines three progressive tiers: writing optimized CUDA kernels, zero-shot LLM fine-tuning, and running fully autonomous multi-agent research pipelines (AutoLab). Use it when your problem requires going closer to the silicon — optimizing inference kernels, automating model training, or setting up autonomous research loops with verifiable experiments. The framework relies on file-based Skills to convert agents from zero-shot to few-shot, open primitives over opaque APIs, and distributed multi-agent teams for parallelizable research.

Framework

// When should you use Burtenshaw AI Systems Engineering via Coding Agents?

Use this skill when you need to push coding agents beyond routine software tasks into genuine AI/ML systems engineering: optimizing inference kernels, automating model training, or setting up autonomous research loops. Trigger it whenever the problem requires going 'closer to the silicon' or involves verifiable ML experiments.

// What inputs do you need to apply this framework?

Target Engineering Problemrequired
Which of the three 'bosses' applies: (1) CUDA kernel writing, (2) zero-shot LLM fine-tuning, or (3) AutoLab multi-agent research.
Hardware Profilerequired
The specific GPU hardware and CUDA version in play (e.g., H100, A100, consumer GPU). Needed because kernels are hardware-specific.
Model and Dataset
The target model (e.g., Qwen 3 8B) and, for fine-tuning or research tasks, the dataset or training script to optimize.
Skills Files
File-based context documents (skills) containing examples, benchmarking scripts, and references specific to the task domain.
Experiment Hypothesis or Goal
For AutoLab: the research question or improvement direction (e.g., improve training efficiency measured in bits-per-byte).

// What core principles guide this approach to AI systems engineering with agents?

Go Closer to the Silicon

To stay contemporary and challenged as an AI engineer in the agent era, move into harder, lower-level problems: AI systems engineering and ML engineering. Routine coding tasks are commoditized; kernel writing and training optimization are not.

Zero-Shot to Few-Shot via Skills

Skills are file-based context — examples, scripts, and references — that an agent can open and close as needed. Loading a skill transforms a task from zero-shot (no examples) to few-shot (guided by examples), dramatically improving agent output quality. Skills should be maintained by the project owners who know them best, not kept as ad-hoc YOLO prompts.

Primitives Over Abstracted APIs

Agents work best with open, inspectable primitives — data layers, file structures, CLI tools — rather than opaque abstracted APIs. Any layer an agent cannot get behind is a ceiling. The goal is to expose well, not always to abstract.

Memory Is the Bottleneck, Not Compute

In deep learning efficiency, the three axes are compute (FLOPs), memory (tensor movement), and overhead (Python/dispatch environment). Counter-intuitively, memory bandwidth — not compute — is usually the bottleneck. Custom kernels solve this by increasing arithmetic intensity: doing more math per read/write cycle to 'keep the GPUs warm'.

Distribute the Research Team

A single iterating agent is less powerful than a distributed multi-agent team with defined roles (Researcher, Planner, Worker, Reporter). Specialization and parallelism unlock scale that a single-agent loop cannot achieve.

Verifiable Experiments as the Foundation

Autonomous agentic engineering works best when the experiment is verifiable — i.e., there is a measurable output (bits-per-byte, inference speedup, validation loss) that agents can objectively score and rank. If you have a verifiable experiment, the AutoLab pattern is straightforward to implement.

// How do you apply Burtenshaw AI Systems Engineering step by step?

1
Select your Boss (problem tier)
Classify the engineering task into one of three progressive tiers: Boss 1 = write/optimize a CUDA kernel; Boss 2 = zero-shot LLM fine-tune on the Hub; Boss 3 = AutoLab multi-agent autonomous research. Each tier has increasing autonomy and complexity. Start at the lowest applicable tier.
2
Define the hardware and compatibility matrix
Identify the exact GPU generation, CUDA version, and software stack. For Boss 1, check the Kernels repo on Hugging Face Hub for an existing compatible kernel before writing from scratch — low-hanging speedup fruit often exists for specific hardware pairings. Populate the .toml configuration file with hardware compatibility metadata.
3
Load or create a Skill file for the task
A Skill is a structured file-based context document. For kernel writing, it should contain: benchmarking scripts, test scripts, and reference examples of working kernels. For fine-tuning, it should include CLI invocation patterns and Hub integration steps. Source skills from project-maintained repos (e.g., Hugging Face Skills repo) rather than writing ad-hoc prompts. This converts the agent task from zero-shot to few-shot.
4
Run the agent interactively (Boss 1) or zero-shot (Boss 2)
For Boss 1 (CUDA kernel): use an interactive/hybrid approach where you guide the agent using the Skill, then benchmark the output kernel using the benchmarking scripts in the Skill. Validate with the kernels library (.toml compatibility check). For Boss 2 (fine-tuning): issue a plain-language instruction such as 'fine-tune [model] on [dataset]' using HF CLI skills; the agent handles the Hub job submission and GPU provisioning.
5
Validate output with upskill (for Skill quality) or benchmark scripts (for kernels)
Use the upskill library to evaluate whether a Skill produces quality outputs across multiple models — it generates an eval, runs the Skill on cheaper/open models, and compares accuracy and token usage. For kernels, run the benchmarking script included in the Skill and measure inference speedup percentage. A 94% speedup on a hardware-specific pairing is a realistic baseline target.
6
Architect the AutoLab agent team (Boss 3 only)
Define four agent roles: (1) Researcher — scans papers (HF Papers CLI or arXiv) and formulates hypotheses; (2) Planner — maintains a queue of experiment jobs from those hypotheses, tracking current hyperparameters and past results; (3) Workers — pick up jobs from the queue, implement changes as training script patches (architecture or parameter modifications), and submit HF Jobs on the Hub; (4) Reporter — monitors all running jobs, maintains the Trackio dashboard, flags anomalies. Use a Git project as the shared state: main branch holds updated training scripts and a scores data structure; each experiment runs on its own branch.
7
Configure shared state and open data layer
Use Trackio as the dashboard and metrics store — its key property is that the underlying data layer is fully open (Parquet files), so agents can query it directly without going through the UI. Store training scripts in an HF bucket so workers do not have to upload/download scripts redundantly. Use HF Jobs labels so agents can tag, sort, and filter runs programmatically. Avoid any tool that puts an opaque API layer between the agent and the data.
8
Execute the AutoLab loop and monitor
Trigger the Planner to propose a first batch of single-change experiments. The Reviewer/Reporter rejects duplicates or stale ideas before passing to Workers. Workers run in parallel (potentially for hours). Monitor via the Trackio dashboard; use its events and notification layer to alert if agents go off-course. Generate a Gantt chart from the Parquet data layer to visualize agent timelines and experiment scores across the run. Iterate.

// What are real-world examples of coding agents doing AI systems engineering?

A team is serving a large language model on H100 GPUs via a cloud provider. Inference costs are high and the model was not originally optimized for this GPU generation.

Apply Boss 1. Check the Kernels Hub repo for an existing kernel compatible with H100 and the model's architecture. Load the kernels Skill (with benchmarking scripts and reference examples). Instruct the coding agent to generate a custom CUDA kernel that increases arithmetic intensity for the model's dominant operation (e.g., attention). The kernel is configured via .toml for H100 compatibility and published back to the Hub as a reusable kernel repo. Benchmark with the Skill scripts; target low-hanging speedup from the hardware-specific optimization even if it is not state-of-the-art.

A researcher wants to improve a small language model's chain-of-thought reasoning without manually writing training code.

Apply Boss 2. Identify the model and a chain-of-thought dataset. Load HF CLI skills. Issue a plain-language instruction: 'fine-tune [model] on [chain-of-thought dataset].' The agent handles script generation, Hub job submission, and GPU provisioning automatically. After training, use upskill to evaluate whether the resulting Skill generalizes across cheaper open models.

An ML engineer wants to autonomously discover training improvements for a small GPT-style model, running experiments in parallel overnight.

Apply Boss 3 (AutoLab). Set up the four-agent team: Researcher scans recent papers for training improvement ideas and outputs hypotheses. Planner queues single-change experiments (e.g., change learning rate schedule, swap activation function). Workers implement each hypothesis as a patch to the training script and submit as HF Jobs. Reporter tracks all runs in Trackio, which stores metrics as open Parquet data. At the end of the run, query the data layer to rank experiments by the verifiable metric (e.g., bits-per-byte or validation loss) and promote the best patch to the main branch.

// What mistakes should you avoid when using coding agents for AI systems engineering?

Treating abstracted APIs as sufficient: any layer the agent cannot inspect or get behind is a hard ceiling on what it can accomplish. Prefer open primitives.
Assuming compute is the bottleneck: in most deep learning workloads, memory bandwidth (tensor movement) is the actual constraint. Kernel optimization should target arithmetic intensity, not raw FLOPs.
Using YOLO/ad-hoc skills instead of maintained project skills: unmanaged skill files degrade quickly. Skills should be owned and maintained by the same team that owns the project they support.
Running a single iterating agent when the task is parallelizable: a single-agent loop misses the efficiency gains of distributing research roles. If experiments are verifiable and independent, use the multi-agent AutoLab pattern.
Skipping the compatibility matrix for kernels: CUDA kernels are hardware-specific. Failing to define the .toml compatibility metadata means the kernel may be valid but silently unusable on the intended hardware.
Not using upskill to validate Skill quality before scaling: deploying a Skill broadly without evaluating it across models and measuring accuracy/token trade-offs leads to unnecessary cost and degraded agent performance.
Building a dashboard that the agent cannot query directly: dashboards backed by proprietary APIs hide data from agents. Use tools like Trackio whose underlying data layer (Parquet) is directly accessible to agents without UI mediation.

// What key terms and concepts does this framework define?

Boss: One of three progressively more complex and autonomous engineering challenges: Boss 1 = CUDA kernel writing (hybrid/interactive), Boss 2 = zero-shot LLM fine-tuning, Boss 3 = AutoLab multi-agent research. The 'boss' framing reflects increasing difficulty and autonomy at each tier.
Skill: A file-based context document — containing examples, scripts, and references — that an agent can open and use on demand. Skills convert agent tasks from zero-shot to few-shot by providing structured, reusable guidance. Best practice: maintained by the team that owns the project the Skill supports.
Zero-Shot to Few-Shot: The transformation that happens when a Skill is loaded: the agent moves from attempting a task with no examples (zero-shot) to having curated examples and references available (few-shot), dramatically improving reliability and output quality.
AutoLab: A multi-agent autonomous research setup modeled on a distributed AI research team. Four roles — Researcher, Planner, Workers, Reporter — operate in parallel to propose, implement, run, and evaluate training experiments automatically.
Arithmetic Intensity: The ratio of computation performed to data movement in a GPU kernel. Custom kernels increase arithmetic intensity by doing more math per read/write cycle — 'keeping the GPUs warm' — which addresses the memory bandwidth bottleneck rather than compute.
Kernels (library): A Hugging Face library and Hub repo format for distributing custom CUDA kernels. Each kernel repo includes a .toml file specifying hardware compatibility (GPU generation, CUDA version) and is maintained by kernel writers, analogous to model repos on the Hub.
upskill: An open-source library for evaluating and comparing Skill quality across different models. It generates an eval for a given Skill, runs it on multiple models, and reports accuracy and token usage — enabling engineers to swap to cheaper or open models without degrading Skill performance.
Trackio: An open-source metrics dashboard whose key property is a fully open data layer (Parquet files). Agents can query the data layer directly without UI mediation, enabling custom visualizations (e.g., Gantt charts) and programmatic access to all experiment metrics and events.
HF Jobs: Hugging Face Hub compute jobs that agents can submit programmatically. Workers in the AutoLab pattern submit training script patches as HF Jobs, which run on Hub-provisioned hardware and push results back to the shared data layer.
Open Primitives: Tools and data structures that are fully inspectable and controllable by agents without opaque abstraction layers — e.g., file systems, CLI tools, Parquet stores, Git repos. Burtenshaw's core principle: agents need open primitives; any abstraction an agent cannot get behind is a ceiling.
Compatibility Matrix: The mapping of a CUDA kernel's hardware and software requirements (GPU generation, CUDA version, library versions), specified in a .toml file. Essential for kernel distribution so users and agents know which hardware a kernel actually supports.
Researcher (AutoLab role): The AutoLab agent responsible for scanning papers (via HF Papers CLI or arXiv) and formulating improvement ideas as hypotheses for the Planner to queue.
Planner (AutoLab role): The AutoLab agent that receives hypotheses from the Researcher and maintains a structured queue of experiments, tracking current hyperparameters, past results, and which ideas are worth pursuing.
Worker (AutoLab role): AutoLab agents that pick up queued hypotheses from the Planner, implement them as patches to the training script (architecture changes, parameter changes), and submit them as HF Jobs.
Reporter (AutoLab role): The AutoLab agent that monitors all running jobs, maintains the Trackio dashboard, surfaces events and anomalies, and produces summary tables for other agents to consume.

// FREQUENTLY ASKED QUESTIONS

What is Burtenshaw AI Systems Engineering via Coding Agents?

It is a framework by Ben Burtenshaw of Hugging Face for applying coding agents to hard AI/ML systems engineering problems rather than routine software tasks. It defines three progressive tiers (called Bosses): writing optimized CUDA kernels, zero-shot LLM fine-tuning on Hugging Face Hub, and running a fully autonomous multi-agent research pipeline called AutoLab. The framework emphasizes file-based Skills for few-shot guidance, open primitives over opaque APIs, and verifiable experiments as the foundation for autonomous engineering.

What is AutoLab in the context of AI agent research?

AutoLab is a multi-agent autonomous research setup where four specialized agent roles — Researcher, Planner, Workers, and Reporter — operate in parallel to propose, implement, run, and evaluate ML training experiments automatically. It uses a Git repo as shared state, Trackio for open metrics storage via Parquet files, and HF Jobs for compute. AutoLab works best when experiments have verifiable metrics like bits-per-byte or validation loss that agents can objectively score and rank.

How do I use coding agents to write CUDA kernels?

Start by checking the Hugging Face Kernels Hub repo for an existing kernel compatible with your GPU hardware. Load a kernel-writing Skill file containing benchmarking scripts, test scripts, and reference kernel examples. Instruct the coding agent interactively to generate a custom CUDA kernel that increases arithmetic intensity for your model's dominant operation. Configure the kernel via a .toml compatibility file specifying your GPU generation and CUDA version, then benchmark the output using the Skill's included scripts.

How do I fine-tune an LLM using a coding agent with zero setup?

Load HF CLI skills into your coding agent, then issue a plain-language instruction like 'fine-tune Qwen 3 8B on this chain-of-thought dataset.' The agent handles training script generation, Hugging Face Hub job submission, and GPU provisioning automatically. After training completes, use the upskill library to evaluate whether the resulting Skill generalizes across cheaper or open-source models. This is the Boss 2 tier in Burtenshaw's framework, requiring less manual guidance than kernel writing.

How does Burtenshaw's approach compare to using coding agents for regular software engineering?

Regular software engineering tasks like writing web apps or CRUD APIs are increasingly commoditized by coding agents. Burtenshaw's framework deliberately pushes agents into harder, lower-level problems — CUDA kernel optimization, ML training automation, and autonomous research — where the work is not yet commoditized. The key differentiator is going 'closer to the silicon,' requiring hardware-specific knowledge, verifiable ML metrics, and open data primitives that standard software agent workflows do not address.

When should I use the AutoLab multi-agent pattern instead of a single coding agent?

Use the AutoLab multi-agent pattern when your experiments are verifiable and independent — meaning each experiment produces a measurable metric and does not depend on another experiment's results. If you can define a clear scoring metric like validation loss or inference speedup, distributing work across Researcher, Planner, Worker, and Reporter agents unlocks parallelism and scale that a single iterating agent cannot achieve. For sequential, tightly coupled tasks, a single agent may still be appropriate.

What are Skills files in the Burtenshaw framework?

Skills are file-based context documents containing examples, benchmarking scripts, and references that a coding agent can open and use on demand. Loading a Skill transforms an agent task from zero-shot (no examples) to few-shot (guided by curated examples), dramatically improving output quality. Skills should be maintained by the team that owns the project they support — not stored as ad-hoc prompts — so they stay current and reliable over time.

What results can I expect from applying this framework to CUDA kernel optimization?

A realistic baseline target is around 94% inference speedup on a hardware-specific GPU pairing, even without achieving state-of-the-art kernel performance. The framework focuses on low-hanging fruit: increasing arithmetic intensity to address the memory bandwidth bottleneck rather than raw compute FLOPs. The generated kernel is published as a reusable Hugging Face Hub kernel repo with .toml compatibility metadata, making it distributable and verifiable by other engineers.

Why does the framework say memory is the bottleneck, not compute?

In most deep learning workloads, moving tensors between GPU memory levels (bandwidth) takes more time than the actual floating-point computation. Custom CUDA kernels solve this by increasing arithmetic intensity — performing more math operations per memory read/write cycle — which 'keeps the GPUs warm.' This counter-intuitive insight means kernel optimization efforts should focus on reducing data movement overhead rather than maximizing raw FLOPs throughput.

What tools does the Burtenshaw framework use for experiment tracking in AutoLab?

The framework uses Trackio, an open-source metrics dashboard whose defining property is a fully open data layer built on Parquet files. Agents can query experiment metrics directly from Parquet without going through a UI or proprietary API. This enables programmatic access for generating Gantt charts, ranking experiments by score, and detecting anomalies. The framework explicitly warns against dashboards with opaque APIs that hide data from agents.

// GET THIS SKILL — FREE