How ML Infra Engineers Use Coding Agents for Kernel Optimization
For ML infrastructure engineers at cloud-serving companies · Based on Burtenshaw AI Systems Engineering via Coding Agents
// TL;DR
ML infrastructure engineers can use Burtenshaw's framework to push coding agents into CUDA kernel optimization — the Boss 1 tier. Instead of manually writing and benchmarking GPU kernels, load a Skill file with reference kernels, benchmarking scripts, and .toml hardware compatibility metadata. The agent generates hardware-specific kernels that increase arithmetic intensity, addressing the memory bandwidth bottleneck. Target 94%+ inference speedup on specific GPU pairings. Publish optimized kernels to the Hugging Face Hub as reusable repos. Use upskill to validate Skill quality before scaling across your fleet.
Why should ML infrastructure engineers care about coding agents for kernel work?
Inference cost is one of the largest line items for any team serving LLMs at scale. Custom CUDA kernels tuned to your specific GPU hardware can deliver dramatic speedups — but writing them manually is time-consuming and requires deep GPU architecture knowledge. Burtenshaw's framework lets you apply coding agents to this problem using structured Skill files that contain benchmarking scripts, reference kernel examples, and hardware compatibility metadata.
The key insight is that memory bandwidth — not compute — is the actual bottleneck in most inference workloads. Coding agents guided by Skills can generate kernels that increase arithmetic intensity: performing more math per read/write cycle to keep GPUs warm. This directly reduces serving cost.
How do you set up a coding agent to write CUDA kernels?
Start by defining your hardware profile. Identify the exact GPU generation (H100, A100, etc.), CUDA version, and software stack. Check the Hugging Face Kernels Hub repo for existing compatible kernels — low-hanging speedup fruit often exists for specific hardware pairings.
Next, load or create a Skill file. A kernel-writing Skill should include:
- Reference examples of working CUDA kernels for similar operations
- Benchmarking scripts that measure inference speedup
- Test scripts for correctness validation
- Hardware-specific patterns and notes
This converts the agent's task from zero-shot (guessing) to few-shot (guided by examples). The quality difference is substantial.
Use an interactive, hybrid approach: guide the agent using the Skill, review its kernel output, then benchmark using the Skill's scripts. Configure the kernel's `.toml` file with compatibility metadata so it's reusable on intended hardware.
What's the biggest mistake infra engineers make with agent-written kernels?
Skipping the compatibility matrix. CUDA kernels are hardware-specific — a kernel valid for H100 (sm_90) may silently underperform or fail on A100 (sm_80). Every kernel must have a `.toml` file specifying GPU generation, CUDA version, and library requirements. Without this, you'll deploy kernels that compile but don't deliver the expected speedup on your actual hardware.
The second mistake is assuming compute is the bottleneck. If your agent is optimizing for raw FLOPs without addressing tensor movement patterns, the resulting kernel will leave memory bandwidth underutilized.
How do you validate and share optimized kernels?
Run the benchmarking scripts included in your Skill file. Measure actual inference speedup against the baseline kernel. A 94% speedup on a hardware-specific pairing is a realistic target for well-guided agents.
Use upskill to evaluate whether your Skill produces quality kernels across multiple models and hardware profiles. If accuracy drops on different setups, the Skill needs more diverse examples.
Publish the optimized kernel back to the Hugging Face Hub as a kernel repo with its `.toml` compatibility file. This makes it reusable by your team and by the broader community.
Next step: Check the Hugging Face Kernels repo for existing kernels compatible with your GPU, load the kernel-writing Skill, and run your first agent-guided kernel optimization on your highest-cost inference workload.
// FREQUENTLY ASKED QUESTIONS
How much inference speedup can I expect from agent-written CUDA kernels?
A realistic baseline is around 94% inference speedup on hardware-specific GPU pairings, especially when the model wasn't originally optimized for your target GPU generation. Results depend on correct hardware profiling, the quality of the Skill file's reference examples, and whether the kernel targets arithmetic intensity rather than raw FLOPs.
Do I need to know CUDA to use this framework for kernel optimization?
You need enough CUDA knowledge to evaluate agent output and configure the .toml compatibility file, but the Skill file provides reference examples and benchmarking scripts that reduce the expertise bar. The agent handles the low-level kernel generation; your role is interactive guidance, hardware profiling, and validation.
Can I use this for kernels beyond attention operations?
Yes. The framework applies to any CUDA kernel — attention, matrix multiplication, normalization, custom activation functions. The Skill file should contain reference examples specific to your target operation. The principle of increasing arithmetic intensity to address memory bandwidth bottlenecks applies universally across kernel types.