How ML Infra Engineers Use Coding Agents for GPU Optimization
For ML infrastructure engineers at AI startups · Based on Burtenshaw AI Systems Engineering via Coding Agents
// TL;DR
ML infrastructure engineers can use the Burtenshaw framework to push coding agents into CUDA kernel optimization and inference cost reduction. Start with Boss 1: check the Hugging Face Kernels Hub for existing kernels matching your GPU, load a kernel-writing Skill with benchmarking scripts, and let the agent generate hardware-specific kernels that increase arithmetic intensity. The framework's emphasis on the .toml compatibility matrix and open primitives ensures kernels are valid for your exact hardware pairing. Target 94%+ inference speedup by addressing the memory bandwidth bottleneck rather than raw compute.
Why should ML infra engineers use coding agents for kernel optimization?
Inference cost is one of the largest line items for AI startups serving large language models. Writing custom CUDA kernels is the highest-leverage optimization available, but it traditionally requires deep GPU programming expertise. Burtenshaw's framework lets ML infrastructure engineers direct coding agents to generate hardware-specific kernels using structured Skill files — converting the task from zero-shot (the agent guessing) to few-shot (the agent guided by reference kernels, benchmarking scripts, and compatibility metadata).
The core insight is that memory bandwidth — not compute — is usually the bottleneck. Custom kernels increase arithmetic intensity: doing more math per memory read/write cycle to keep the GPUs utilized. This addresses the actual constraint rather than throwing more FLOPs at the problem.
How do you set up a kernel optimization workflow with coding agents?
Start by defining your hardware profile: exact GPU generation (H100, A100, consumer GPU), CUDA version, and software stack. Check the Hugging Face Kernels Hub for an existing kernel compatible with your hardware and model architecture — low-hanging speedup fruit often already exists.
Next, load or create a Skill file. A good kernel Skill contains three things: reference examples of working kernels for similar hardware, benchmarking scripts that measure inference speedup, and test scripts that validate correctness. Store the Skill in a version-controlled repo maintained by your infra team.
Configure the .toml compatibility file with your hardware metadata. This is not optional — CUDA kernels are hardware-specific, and skipping this step means the kernel may be valid code but silently unusable on your intended GPU.
Run the agent interactively (Boss 1 is hybrid, not fully autonomous). Guide it to target the model's dominant operation — typically attention — and increase arithmetic intensity for that operation. Benchmark the output kernel against the baseline using the Skill's scripts.
What results should ML infra engineers expect?
A realistic baseline is around 94% inference speedup on a hardware-specific pairing, even without achieving state-of-the-art kernel performance. The generated kernel can be published as a reusable Hugging Face Hub kernel repo with .toml compatibility metadata, making it distributable across your team and infrastructure.
The framework explicitly warns against two pitfalls relevant to infra engineers: assuming compute is the bottleneck (optimize for memory bandwidth instead), and skipping the compatibility matrix (kernels are hardware-specific — validate the .toml before deploying).
What's the next step?
Check the Hugging Face Kernels Hub for existing kernels matching your GPU and model architecture. If none exist, create a kernel-writing Skill file with your benchmarking scripts and reference examples. Start with the interactive Boss 1 workflow and measure speedup on your actual inference workload before scaling.
// FREQUENTLY ASKED QUESTIONS
Can a coding agent write production-ready CUDA kernels?
A coding agent guided by a well-maintained Skill file can generate kernels that achieve significant inference speedups on specific hardware. However, you must validate the output using benchmarking scripts and the .toml compatibility matrix. The framework recommends an interactive (hybrid) approach for kernel writing — not fully autonomous — because hardware-specific optimization requires human oversight to catch subtle correctness issues.
Do I need to check the Kernels Hub before writing a new kernel?
Yes. The Hugging Face Kernels Hub often contains existing kernels optimized for specific GPU generations and model architectures. Checking the Hub first is the lowest-effort path to inference speedup. If a compatible kernel exists, you can skip the generation step entirely and focus on benchmarking and integration.
What GPU hardware does this framework support?
The framework supports any GPU hardware as long as you define the compatibility matrix in the .toml file — including H100, A100, and consumer GPUs. However, kernels are hardware-specific, so a kernel optimized for one GPU generation may not deliver the same speedup on another. Always benchmark on your target hardware.