Frequently Asked Questions About Burtenshaw AI Systems Engineering via Coding Agents

22 answers covering everything from basics to advanced usage.

// Basics

What is the difference between Boss 1, Boss 2, and Boss 3 in Burtenshaw's framework?

Boss 1 is CUDA kernel writing — interactive and hybrid, where you guide the agent and benchmark outputs. Boss 2 is zero-shot LLM fine-tuning — the agent handles script generation, job submission, and GPU provisioning from a plain-language instruction. Boss 3 is AutoLab — a fully autonomous multi-agent research pipeline with four specialized roles running parallel experiments. Each tier represents increasing autonomy and complexity. Start at the lowest applicable tier for your problem.

What does 'go closer to the silicon' mean for AI engineers using coding agents?

It means moving from commoditized high-level software tasks into lower-level, harder problems like GPU kernel optimization, memory bandwidth management, and ML training infrastructure. Routine coding — CRUD apps, API wrappers — is increasingly handled by generic agents. AI systems engineering (custom CUDA kernels, training pipelines, inference optimization) remains non-commoditized and requires deeper hardware understanding. This is where AI engineers stay relevant and where coding agents provide the most differentiated value.

How do Skills convert zero-shot to few-shot for coding agents?

A Skill is a file containing examples, scripts, and references relevant to a specific task. When an agent opens a Skill file, it gains curated context — benchmarking scripts, working kernel examples, CLI patterns — that it didn't have at zero-shot. This transforms the agent's attempt from guessing without examples to performing with structured guidance. The quality difference is dramatic: agents with Skills produce more reliable, higher-quality outputs and waste fewer tokens on dead-end approaches.

Why does Burtenshaw say memory is the bottleneck and not compute?

In most deep learning workloads, GPUs spend more time waiting for data to move between memory tiers (HBM, L2 cache, registers) than performing actual math. Modern GPUs have enormous compute capacity (FLOPs) but comparatively limited memory bandwidth. Custom CUDA kernels address this by increasing arithmetic intensity — performing more computation per byte moved. Optimizing for raw FLOPs while ignoring tensor movement patterns leaves the GPU 'cold' and underutilized.

What is the open primitives principle and why does it matter for agents?

Open primitives are tools and data structures that agents can fully inspect and control: file systems, CLI tools, Parquet files, Git repos. The principle states that any abstraction layer an agent cannot get behind is a hard ceiling on what it can accomplish. For example, a dashboard with a proprietary API prevents agents from querying raw metrics. Using Trackio with Parquet files instead lets agents query, filter, and visualize data directly. This principle applies throughout the framework — from kernel repos to experiment tracking.

Why should Skills be maintained by the project team and not prompt engineers?

Skills contain domain-specific knowledge — benchmarking scripts, reference implementations, hardware compatibility details — that only the project team fully understands. Prompt engineers disconnected from the project create ad-hoc Skills that degrade as the project evolves. When the project team maintains Skills alongside their code, the Skills stay current, accurate, and aligned with actual workflows. This is analogous to documentation ownership: docs maintained by developers who write the code are more reliable than docs written by outsiders.

// How To

How do I set up the AutoLab four-agent team?

Define four roles: (1) Researcher scans papers via HF Papers CLI or arXiv and outputs hypotheses. (2) Planner maintains a queue of single-change experiments from those hypotheses, tracking hyperparameters and past results. (3) Workers pick up queued jobs, implement changes as training script patches, and submit them as HF Jobs. (4) Reporter monitors all running jobs via Trackio, flags anomalies, and maintains summary tables. Use a Git repo as shared state with each experiment on its own branch.

How do I create a Skill file for CUDA kernel writing?

A kernel-writing Skill file should contain: reference examples of working CUDA kernels for similar operations, benchmarking scripts that measure inference speedup, test scripts for correctness validation, and notes on hardware-specific patterns (e.g., H100 memory hierarchy). Store the Skill in the same repo as the kernel project. Source from maintained repos like the Hugging Face Skills repo rather than writing ad-hoc prompts. Include .toml metadata for hardware compatibility.

How do I use upskill to validate a Skill before deploying it?

Run the upskill library against your Skill to generate an evaluation. Upskill runs the Skill on multiple models — including cheaper and open-source alternatives — and reports accuracy and token usage for each. Compare results to determine if the Skill generalizes or is overfit to a single expensive model. This prevents deploying Skills broadly that only work well on GPT-4 but fail on open models, saving cost and improving reliability across your agent fleet.

How do I configure the .toml compatibility matrix for a CUDA kernel?

The .toml file specifies the kernel's hardware and software requirements: GPU generation (e.g., H100, A100), CUDA version, library versions, and any architecture-specific flags. Populate this before publishing the kernel to the Hub. Without it, the kernel may compile but silently fail or underperform on unintended hardware. Check existing kernels in the Hugging Face Kernels repo for .toml templates and conventions specific to your target GPU family.

// Troubleshooting

My coding agent keeps writing CUDA kernels that don't work on my GPU. What's wrong?

CUDA kernels are hardware-specific. The most common cause is a missing or incorrect compatibility matrix — the .toml file that specifies GPU generation and CUDA version. Ensure your Skill file includes reference kernels for your exact GPU. Also verify that the agent has access to the correct CUDA toolkit version and that the kernel targets the right compute capability (e.g., sm_90 for H100 vs sm_80 for A100). Loading a hardware-specific Skill converts the task from zero-shot guessing to guided generation.

My AutoLab agents are running duplicate experiments. How do I fix this?

The Planner agent should maintain a structured queue with deduplication logic — tracking which hypotheses have been tested, their hyperparameters, and results. The Reporter should reject duplicate or stale ideas before passing them to Workers. Ensure the shared Git repo state includes a scores data structure that all agents can query. If duplicates persist, add explicit deduplication checks in the Planner's prompt and use HF Jobs labels so the Reporter can filter runs programmatically.

My agent can't access the experiment dashboard data. What should I change?

Replace any dashboard backed by a proprietary API with one that exposes an open data layer. Trackio stores metrics as Parquet files that agents can query directly without UI mediation. If you're using a tool like Weights & Biases or MLflow, check whether agents can access the underlying data via file system or open API — any opaque layer the agent cannot get behind is a ceiling on its autonomous capabilities.

// Comparisons

How does Burtenshaw's framework compare to using LangChain or CrewAI for multi-agent systems?

LangChain and CrewAI provide general-purpose multi-agent orchestration frameworks. Burtenshaw's framework is domain-specific to AI systems engineering — it prescribes four concrete roles (Researcher, Planner, Workers, Reporter) tailored to ML experimentation, uses Git as shared state, and relies on verifiable experiment metrics as the coordination signal. The key philosophical difference is the insistence on open primitives: Parquet files, CLI tools, and file-based Skills rather than abstracted API chains. Generic frameworks may add opaque layers that limit agent autonomy in systems engineering contexts.

How does writing CUDA kernels with agents compare to using Triton or compiler-based approaches?

Triton and compiler-based approaches (like TorchInductor) auto-generate kernels from higher-level Python, trading peak performance for ease of use. Agent-written CUDA kernels can achieve hardware-specific optimizations that compiler heuristics miss, especially for non-standard operations or new GPU generations. The tradeoff: agent-written kernels require a Skill file with benchmarking scripts and a compatibility matrix, while Triton is more portable. Burtenshaw's approach is best when you need peak performance on a specific GPU pairing and have verifiable benchmarks.

How is Burtenshaw's Skills approach different from just putting examples in the system prompt?

System prompt examples are static, unversioned, and typically maintained by prompt engineers disconnected from the project. Skills are file-based, versioned in the project repo, and maintained by the team that owns the domain. Agents can open and close Skill files as needed rather than carrying all context in a fixed prompt. Skills can include executable scripts (benchmarks, tests) — not just text — and can be evaluated with upskill across multiple models. This structure enables quality control, reuse, and cost optimization that ad-hoc prompt examples cannot support.

// Advanced

Can I use this framework with consumer GPUs or do I need H100s?

The framework applies to any GPU, but kernel optimization results are hardware-specific. Consumer GPUs (e.g., RTX 4090) have different memory hierarchies and compute capabilities than data center GPUs (H100, A100). You must define the correct hardware profile in the .toml compatibility matrix. The Boss 2 (fine-tuning) and Boss 3 (AutoLab) tiers can offload compute to HF Jobs on Hub-provisioned hardware, so your local GPU doesn't need to match the target. Skills should include reference kernels for your specific GPU generation.

What kinds of experiments work best with the AutoLab pattern?

AutoLab works best with verifiable experiments — those that produce a measurable, objective output metric like validation loss, bits-per-byte, inference latency, or accuracy. The experiments should be independent enough to run in parallel and implementable as single-change patches to a training script. Examples: learning rate schedule changes, activation function swaps, attention mechanism modifications, data augmentation strategies. Experiments that require subjective evaluation (e.g., 'does the output sound better?') are poor fits because agents cannot objectively score and rank them.

How do I scale AutoLab to run dozens of experiments in parallel?

Use HF Jobs for compute — each Worker agent submits experiments as Hub jobs that run on provisioned hardware. Store training scripts in an HF bucket so workers don't redundantly upload/download. Use HF Jobs labels to tag, sort, and filter runs programmatically. The Planner should batch experiments and the Reporter should monitor all concurrent runs via Trackio's Parquet data layer. Scale the number of Worker agents to match your compute budget. Git branches isolate each experiment's state.

What's the relationship between Hugging Face Hub and this framework?

Hugging Face Hub is the infrastructure backbone. Kernels are published as Hub repos with .toml compatibility metadata. Fine-tuning jobs run as HF Jobs on Hub-provisioned GPUs. AutoLab Workers submit experiment patches as HF Jobs. Trackio pushes metrics to Hub-hosted Parquet files. Skills are sourced from Hub-hosted repos. The framework treats the Hub as an open platform with inspectable primitives — file storage, CLI tools, job submission — rather than an opaque API, aligning with the 'primitives over abstracted APIs' principle.

How do I know if my Skill is good enough before using it with agents?

Use the upskill library. It generates an evaluation for your Skill, runs it on multiple models (including cheaper and open-source alternatives), and reports accuracy and token usage. A good Skill should produce consistent quality across models, not just on the most expensive one. If accuracy drops sharply on cheaper models, the Skill likely needs more examples or clearer structure. Also validate kernel Skills by running their included benchmarking scripts and checking that output kernels meet the target speedup.

Can I use this framework without Hugging Face tools?

The principles — Skills for few-shot guidance, open primitives, verifiable experiments, multi-agent distribution — are tool-agnostic. However, the specific workflow uses Hugging Face infrastructure: Hub for kernel repos, HF Jobs for compute, Trackio for metrics, upskill for Skill evaluation. You could substitute other tools if they meet the open primitives requirement — the agent must be able to inspect and query the underlying data layer. The key constraint is avoiding opaque APIs that create ceilings on agent autonomy.