Burtenshaw AI Systems Engineering via Coding Agents
Apply coding agents to tackle hard AI systems engineering problems — from writing optimized CUDA kernels to fine-tuning LLMs to running a fully autonomous multi-agent AutoLab research pipeline.
// TL;DR
Burtenshaw AI Systems Engineering via Coding Agents is a framework for pushing coding agents beyond routine software tasks into genuine AI/ML systems engineering. It covers three progressive tiers: writing optimized CUDA kernels, zero-shot LLM fine-tuning, and running fully autonomous multi-agent AutoLab research pipelines. Use it when your problem requires going 'closer to the silicon' — optimizing inference kernels, automating model training, or setting up autonomous research loops with verifiable experiments. The framework relies on file-based Skills to convert zero-shot agent tasks into few-shot guided work, open primitives over opaque APIs, and distributed multi-agent teams for parallel experimentation.
// When should you use coding agents for AI systems engineering instead of routine software tasks?
Use this skill when you need to push coding agents beyond routine software tasks into genuine AI/ML systems engineering: optimizing inference kernels, automating model training, or setting up autonomous research loops. Trigger it whenever the problem requires going 'closer to the silicon' or involves verifiable ML experiments.
// What inputs do you need before applying coding agents to AI systems engineering?
- Target Engineering Problemrequired
Which of the three 'bosses' applies: (1) CUDA kernel writing, (2) zero-shot LLM fine-tuning, or (3) AutoLab multi-agent research. - Hardware Profilerequired
The specific GPU hardware and CUDA version in play (e.g., H100, A100, consumer GPU). Needed because kernels are hardware-specific. - Model and Dataset
The target model (e.g., Qwen 3 8B) and, for fine-tuning or research tasks, the dataset or training script to optimize. - Skills Files
File-based context documents (skills) containing examples, benchmarking scripts, and references specific to the task domain. - Experiment Hypothesis or Goal
For AutoLab: the research question or improvement direction (e.g., improve training efficiency measured in bits-per-byte).
// What principles guide effective AI systems engineering with coding agents?
Go Closer to the Silicon
To stay contemporary and challenged as an AI engineer in the agent era, move into harder, lower-level problems: AI systems engineering and ML engineering. Routine coding tasks are commoditized; kernel writing and training optimization are not.
Zero-Shot to Few-Shot via Skills
Skills are file-based context — examples, scripts, and references — that an agent can open and close as needed. Loading a skill transforms a task from zero-shot (no examples) to few-shot (guided by examples), dramatically improving agent output quality. Skills should be maintained by the project owners who know them best, not kept as ad-hoc YOLO prompts.
Primitives Over Abstracted APIs
Agents work best with open, inspectable primitives — data layers, file structures, CLI tools — rather than opaque abstracted APIs. Any layer an agent cannot get behind is a ceiling. The goal is to expose well, not always to abstract.
Memory Is the Bottleneck, Not Compute
In deep learning efficiency, the three axes are compute (FLOPs), memory (tensor movement), and overhead (Python/dispatch environment). Counter-intuitively, memory bandwidth — not compute — is usually the bottleneck. Custom kernels solve this by increasing arithmetic intensity: doing more math per read/write cycle to 'keep the GPUs warm'.
Distribute the Research Team
A single iterating agent is less powerful than a distributed multi-agent team with defined roles (Researcher, Planner, Worker, Reporter). Specialization and parallelism unlock scale that a single-agent loop cannot achieve.
Verifiable Experiments as the Foundation
Autonomous agentic engineering works best when the experiment is verifiable — i.e., there is a measurable output (bits-per-byte, inference speedup, validation loss) that agents can objectively score and rank. If you have a verifiable experiment, the AutoLab pattern is straightforward to implement.
// How do you apply Burtenshaw's AI systems engineering framework step by step?
- 1
Select your Boss (problem tier)
Classify the engineering task into one of three progressive tiers: Boss 1 = write/optimize a CUDA kernel; Boss 2 = zero-shot LLM fine-tune on the Hub; Boss 3 = AutoLab multi-agent autonomous research. Each tier has increasing autonomy and complexity. Start at the lowest applicable tier.
- 2
Define the hardware and compatibility matrix
Identify the exact GPU generation, CUDA version, and software stack. For Boss 1, check the Kernels repo on Hugging Face Hub for an existing compatible kernel before writing from scratch — low-hanging speedup fruit often exists for specific hardware pairings. Populate the .toml configuration file with hardware compatibility metadata.
- 3
Load or create a Skill file for the task
A Skill is a structured file-based context document. For kernel writing, it should contain: benchmarking scripts, test scripts, and reference examples of working kernels. For fine-tuning, it should include CLI invocation patterns and Hub integration steps. Source skills from project-maintained repos (e.g., Hugging Face Skills repo) rather than writing ad-hoc prompts. This converts the agent task from zero-shot to few-shot.
- 4
Run the agent interactively (Boss 1) or zero-shot (Boss 2)
For Boss 1 (CUDA kernel): use an interactive/hybrid approach where you guide the agent using the Skill, then benchmark the output kernel using the benchmarking scripts in the Skill. Validate with the kernels library (.toml compatibility check). For Boss 2 (fine-tuning): issue a plain-language instruction such as 'fine-tune [model] on [dataset]' using HF CLI skills; the agent handles the Hub job submission and GPU provisioning.
- 5
Validate output with upskill (for Skill quality) or benchmark scripts (for kernels)
Use the upskill library to evaluate whether a Skill produces quality outputs across multiple models — it generates an eval, runs the Skill on cheaper/open models, and compares accuracy and token usage. For kernels, run the benchmarking script included in the Skill and measure inference speedup percentage. A 94% speedup on a hardware-specific pairing is a realistic baseline target.
- 6
Architect the AutoLab agent team (Boss 3 only)
Define four agent roles: (1) Researcher — scans papers (HF Papers CLI or arXiv) and formulates hypotheses; (2) Planner — maintains a queue of experiment jobs from those hypotheses, tracking current hyperparameters and past results; (3) Workers — pick up jobs from the queue, implement changes as training script patches (architecture or parameter modifications), and submit HF Jobs on the Hub; (4) Reporter — monitors all running jobs, maintains the Trackio dashboard, flags anomalies. Use a Git project as the shared state: main branch holds updated training scripts and a scores data structure; each experiment runs on its own branch.
- 7
Configure shared state and open data layer
Use Trackio as the dashboard and metrics store — its key property is that the underlying data layer is fully open (Parquet files), so agents can query it directly without going through the UI. Store training scripts in an HF bucket so workers do not have to upload/download scripts redundantly. Use HF Jobs labels so agents can tag, sort, and filter runs programmatically. Avoid any tool that puts an opaque API layer between the agent and the data.
- 8
Execute the AutoLab loop and monitor
Trigger the Planner to propose a first batch of single-change experiments. The Reviewer/Reporter rejects duplicates or stale ideas before passing to Workers. Workers run in parallel (potentially for hours). Monitor via the Trackio dashboard; use its events and notification layer to alert if agents go off-course. Generate a Gantt chart from the Parquet data layer to visualize agent timelines and experiment scores across the run. Iterate.
// What are real-world examples of coding agents doing AI systems engineering?
A team is serving a large language model on H100 GPUs via a cloud provider. Inference costs are high and the model was not originally optimized for this GPU generation.
Apply Boss 1. Check the Kernels Hub repo for an existing kernel compatible with H100 and the model's architecture. Load the kernels Skill (with benchmarking scripts and reference examples). Instruct the coding agent to generate a custom CUDA kernel that increases arithmetic intensity for the model's dominant operation (e.g., attention). The kernel is configured via .toml for H100 compatibility and published back to the Hub as a reusable kernel repo. Benchmark with the Skill scripts; target low-hanging speedup from the hardware-specific optimization even if it is not state-of-the-art.
A researcher wants to improve a small language model's chain-of-thought reasoning without manually writing training code.
Apply Boss 2. Identify the model and a chain-of-thought dataset. Load HF CLI skills. Issue a plain-language instruction: 'fine-tune [model] on [chain-of-thought dataset].' The agent handles script generation, Hub job submission, and GPU provisioning automatically. After training, use upskill to evaluate whether the resulting Skill generalizes across cheaper open models.
An ML engineer wants to autonomously discover training improvements for a small GPT-style model, running experiments in parallel overnight.
Apply Boss 3 (AutoLab). Set up the four-agent team: Researcher scans recent papers for training improvement ideas and outputs hypotheses. Planner queues single-change experiments (e.g., change learning rate schedule, swap activation function). Workers implement each hypothesis as a patch to the training script and submit as HF Jobs. Reporter tracks all runs in Trackio, which stores metrics as open Parquet data. At the end of the run, query the data layer to rank experiments by the verifiable metric (e.g., bits-per-byte or validation loss) and promote the best patch to the main branch.
// What mistakes should you avoid when using coding agents for AI systems engineering?
- Treating abstracted APIs as sufficient: any layer the agent cannot inspect or get behind is a hard ceiling on what it can accomplish. Prefer open primitives.
- Assuming compute is the bottleneck: in most deep learning workloads, memory bandwidth (tensor movement) is the actual constraint. Kernel optimization should target arithmetic intensity, not raw FLOPs.
- Using YOLO/ad-hoc skills instead of maintained project skills: unmanaged skill files degrade quickly. Skills should be owned and maintained by the same team that owns the project they support.
- Running a single iterating agent when the task is parallelizable: a single-agent loop misses the efficiency gains of distributing research roles. If experiments are verifiable and independent, use the multi-agent AutoLab pattern.
- Skipping the compatibility matrix for kernels: CUDA kernels are hardware-specific. Failing to define the .toml compatibility metadata means the kernel may be valid but silently unusable on the intended hardware.
- Not using upskill to validate Skill quality before scaling: deploying a Skill broadly without evaluating it across models and measuring accuracy/token trade-offs leads to unnecessary cost and degraded agent performance.
- Building a dashboard that the agent cannot query directly: dashboards backed by proprietary APIs hide data from agents. Use tools like Trackio whose underlying data layer (Parquet) is directly accessible to agents without UI mediation.
// What do the key terms in Burtenshaw's AI systems engineering framework mean?
- Boss
- One of three progressively more complex and autonomous engineering challenges: Boss 1 = CUDA kernel writing (hybrid/interactive), Boss 2 = zero-shot LLM fine-tuning, Boss 3 = AutoLab multi-agent research. The 'boss' framing reflects increasing difficulty and autonomy at each tier.
- Skill
- A file-based context document — containing examples, scripts, and references — that an agent can open and use on demand. Skills convert agent tasks from zero-shot to few-shot by providing structured, reusable guidance. Best practice: maintained by the team that owns the project the Skill supports.
- Zero-Shot to Few-Shot
- The transformation that happens when a Skill is loaded: the agent moves from attempting a task with no examples (zero-shot) to having curated examples and references available (few-shot), dramatically improving reliability and output quality.
- AutoLab
- A multi-agent autonomous research setup modeled on a distributed AI research team. Four roles — Researcher, Planner, Workers, Reporter — operate in parallel to propose, implement, run, and evaluate training experiments automatically.
- Arithmetic Intensity
- The ratio of computation performed to data movement in a GPU kernel. Custom kernels increase arithmetic intensity by doing more math per read/write cycle — 'keeping the GPUs warm' — which addresses the memory bandwidth bottleneck rather than compute.
- Kernels (library)
- A Hugging Face library and Hub repo format for distributing custom CUDA kernels. Each kernel repo includes a .toml file specifying hardware compatibility (GPU generation, CUDA version) and is maintained by kernel writers, analogous to model repos on the Hub.
- upskill
- An open-source library for evaluating and comparing Skill quality across different models. It generates an eval for a given Skill, runs it on multiple models, and reports accuracy and token usage — enabling engineers to swap to cheaper or open models without degrading Skill performance.
- Trackio
- An open-source metrics dashboard whose key property is a fully open data layer (Parquet files). Agents can query the data layer directly without UI mediation, enabling custom visualizations (e.g., Gantt charts) and programmatic access to all experiment metrics and events.
- HF Jobs
- Hugging Face Hub compute jobs that agents can submit programmatically. Workers in the AutoLab pattern submit training script patches as HF Jobs, which run on Hub-provisioned hardware and push results back to the shared data layer.
- Open Primitives
- Tools and data structures that are fully inspectable and controllable by agents without opaque abstraction layers — e.g., file systems, CLI tools, Parquet stores, Git repos. Burtenshaw's core principle: agents need open primitives; any abstraction an agent cannot get behind is a ceiling.
- Compatibility Matrix
- The mapping of a CUDA kernel's hardware and software requirements (GPU generation, CUDA version, library versions), specified in a .toml file. Essential for kernel distribution so users and agents know which hardware a kernel actually supports.
- Researcher (AutoLab role)
- The AutoLab agent responsible for scanning papers (via HF Papers CLI or arXiv) and formulating improvement ideas as hypotheses for the Planner to queue.
- Planner (AutoLab role)
- The AutoLab agent that receives hypotheses from the Researcher and maintains a structured queue of experiments, tracking current hyperparameters, past results, and which ideas are worth pursuing.
- Worker (AutoLab role)
- AutoLab agents that pick up queued hypotheses from the Planner, implement them as patches to the training script (architecture changes, parameter changes), and submit them as HF Jobs.
- Reporter (AutoLab role)
- The AutoLab agent that monitors all running jobs, maintains the Trackio dashboard, surfaces events and anomalies, and produces summary tables for other agents to consume.
// FREQUENTLY ASKED QUESTIONS
What is Burtenshaw AI Systems Engineering via Coding Agents?
It is a framework by Ben Burtenshaw of Hugging Face for applying coding agents to hard AI/ML systems engineering problems rather than routine software tasks. It defines three progressive tiers (called 'Bosses'): writing optimized CUDA kernels, zero-shot LLM fine-tuning on Hugging Face Hub, and running fully autonomous multi-agent AutoLab research pipelines. The framework emphasizes file-based Skills for few-shot guidance, open primitives over opaque APIs, and verifiable experiments as the foundation for autonomous agent work.
What is the AutoLab pattern in AI systems engineering?
AutoLab is a multi-agent autonomous research setup modeled on a distributed AI research team. It uses four specialized agent roles — Researcher, Planner, Workers, and Reporter — operating in parallel to propose hypotheses, queue experiments, implement training script patches, submit compute jobs, and track results. The shared state is a Git repo with open data layers like Parquet files, enabling agents to query metrics directly. AutoLab works best when experiments produce verifiable, measurable outputs like validation loss or bits-per-byte.
How do I use coding agents to write CUDA kernels?
Start by checking the Hugging Face Kernels Hub repo for an existing kernel compatible with your GPU and model. Load a kernel-writing Skill file containing benchmarking scripts, test scripts, and reference kernel examples. Use an interactive, hybrid approach: guide the agent with the Skill, then benchmark the output kernel. Configure the kernel's .toml file with your hardware compatibility metadata (GPU generation, CUDA version). Realistic targets include 94% inference speedup on hardware-specific pairings. Publish the kernel back to the Hub as a reusable repo.
How do I fine-tune an LLM using a coding agent with zero-shot instructions?
Load HF CLI skills and issue a plain-language instruction such as 'fine-tune Qwen 3 8B on this chain-of-thought dataset.' The coding agent handles training script generation, Hub job submission, and GPU provisioning automatically. After training completes, use the upskill library to evaluate whether the resulting Skill generalizes across cheaper or open models. This is the Boss 2 tier of the framework — it requires less interactive guidance than kernel writing but still benefits from maintained project Skills.
How does Burtenshaw's approach compare to just using a coding agent for regular software development?
Regular coding agent usage targets commoditized software tasks — writing CRUD apps, fixing bugs, generating boilerplate. Burtenshaw's framework pushes agents into AI systems engineering, where the problems are harder and less commoditized: optimizing CUDA kernels for specific GPU hardware, automating ML training pipelines, and running autonomous research loops. The key differences are the use of file-based Skills for few-shot guidance, insistence on open primitives the agent can inspect, focus on memory bandwidth over compute, and multi-agent distribution for parallelizable research.
When should I use the AutoLab multi-agent pattern instead of a single coding agent?
Use AutoLab when your task involves multiple independent, verifiable experiments that can run in parallel — for example, testing different hyperparameters, architecture changes, or training strategies overnight. A single iterating agent is sufficient for interactive kernel writing or one-off fine-tuning jobs. But when the goal is autonomous discovery across many experiments with measurable outcomes (validation loss, bits-per-byte), the distributed four-role team (Researcher, Planner, Workers, Reporter) unlocks scale and efficiency a single agent loop cannot match.
What results can I expect from applying coding agents to CUDA kernel optimization?
A realistic baseline target is around 94% inference speedup on hardware-specific GPU pairings, especially when the original model was not optimized for the target GPU generation. Results depend heavily on the hardware compatibility matrix — kernels are GPU-specific, so defining the .toml compatibility metadata correctly is essential. The biggest gains typically come from increasing arithmetic intensity (doing more math per memory read/write), which addresses the memory bandwidth bottleneck rather than raw compute. Low-hanging speedups often already exist in the Kernels Hub repo.
What are Skills files in coding agent AI engineering?
Skills are file-based context documents containing examples, scripts, and references that a coding agent can open and use on demand. They convert agent tasks from zero-shot (no examples) to few-shot (guided by examples), dramatically improving output quality and reliability. Best practice is to maintain Skills in the same repo as the project they support, owned by the team that knows the domain best. Ad-hoc 'YOLO prompts' degrade quickly — structured, versioned Skill files are the alternative.
What is arithmetic intensity and why does it matter for CUDA kernel agents?
Arithmetic intensity is the ratio of computation performed to data movement in a GPU kernel. In deep learning, memory bandwidth — not compute — is usually the actual bottleneck. Custom CUDA kernels solve this by increasing arithmetic intensity: doing more math per read/write cycle to 'keep the GPUs warm.' When coding agents write kernels, they should target operations that reduce memory transfers relative to FLOPs, not simply maximize raw compute throughput.
What tools does this framework use for experiment tracking and agent data access?
The framework uses Trackio as the metrics dashboard and data store. Trackio's key property is a fully open data layer built on Parquet files, which agents can query directly without going through a UI or proprietary API. This enables agents to build custom visualizations like Gantt charts, filter runs programmatically, and detect anomalies. HF Jobs labels let agents tag and sort compute runs. The principle is that any dashboard the agent cannot query directly is a ceiling on autonomous operation.
Turn Any YouTube Video Into An AI Skill
SkillForge captures a creator's exact methodology from their video and turns it into a reusable AI skill you can invoke in Claude, ChatGPT, or any LLM.
Forge your own skill