Question 1

What does 'go closer to the silicon' mean for coding agents?

Accepted Answer

It means directing coding agents toward lower-level, hardware-specific engineering tasks — like writing CUDA kernels optimized for specific GPU generations — rather than high-level application code. The idea is that routine software tasks are commoditized by current agents, so the competitive frontier for AI engineers is in ML systems work that requires understanding GPU architecture, memory hierarchies, and inference optimization. This is where agents can still deliver differentiated value.

Question 2

What is the difference between Boss 1, Boss 2, and Boss 3 in this framework?

Accepted Answer

Boss 1 is CUDA kernel writing with interactive agent guidance — the lowest autonomy tier. Boss 2 is zero-shot LLM fine-tuning where the agent handles the full pipeline from a plain-language instruction. Boss 3 is AutoLab, a fully autonomous multi-agent research pipeline. Each Boss increases in complexity and agent autonomy. You should start at the lowest applicable tier for your problem and only escalate when the task demands distributed, parallel experimentation.

Question 3

What is arithmetic intensity and why does it matter for GPU kernels?

Accepted Answer

Arithmetic intensity is the ratio of computation performed to data moved in a GPU kernel. Higher arithmetic intensity means the GPU spends more time computing and less time waiting for data transfers. In Burtenshaw's framework, custom kernels increase arithmetic intensity to address the memory bandwidth bottleneck — doing more math per read/write cycle to 'keep the GPUs warm.' This is the primary optimization lever for inference speedup, more impactful than adding raw compute.

Question 4

How do I create a good Skill file for a coding agent?

Accepted Answer

A good Skill file should contain three components: reference examples of working outputs (e.g., completed kernels or training scripts), benchmarking or test scripts the agent can run to validate its work, and structured metadata about the task domain. Store Skills in version-controlled repos maintained by the team that owns the relevant project. Avoid ad-hoc prompt files — they degrade quickly. Use the upskill library to evaluate Skill quality across multiple models before deploying at scale.

Question 5

How do I set up the four AutoLab agent roles?

Accepted Answer

Define four agents: the Researcher scans papers via HF Papers CLI or arXiv and outputs hypotheses; the Planner maintains a structured experiment queue tracking hyperparameters and past results; Workers pick up queued jobs, implement them as patches to the training script, and submit HF Jobs; the Reporter monitors all runs via Trackio, flags anomalies, and produces summary tables. Use a Git repo as shared state — main branch holds the current best training script and a scores data structure, while each experiment runs on its own branch.

Question 6

How do I validate that a CUDA kernel generated by an agent actually works on my hardware?

Accepted Answer

First, check the .toml compatibility file to confirm the kernel targets your exact GPU generation and CUDA version. Then run the benchmarking scripts included in the kernel Skill file against your hardware. Compare the output against the baseline (unoptimized) kernel performance. The Hugging Face Kernels library provides a structured format for this — each kernel repo includes compatibility metadata and test harnesses. Never deploy a kernel that hasn't been benchmarked on the target hardware.

Question 7

How do I use upskill to evaluate a Skill before deploying it?

Accepted Answer

The upskill library generates an evaluation for a given Skill, runs it across multiple models (including cheaper or open-source alternatives), and reports accuracy and token usage for each. This lets you determine whether a Skill generalizes beyond the model it was originally designed for and whether you can swap to a more cost-effective model without degrading quality. Run upskill before scaling any Skill to production to avoid unnecessary cost and inconsistent agent performance.

Question 8

What if my coding agent can't access the data layer behind my experiment dashboard?

Accepted Answer

This is a known ceiling in the framework. Any dashboard or tool backed by a proprietary API that hides the underlying data from the agent limits what the agent can accomplish. Switch to a tool with an open data layer — the framework recommends Trackio, which stores all metrics as Parquet files that agents can query directly without UI mediation. If you must use a proprietary dashboard, build an adapter that exposes the data as files or a CLI the agent can call.

Question 9

My agent keeps generating kernels that don't match my GPU — what's wrong?

Accepted Answer

You likely skipped the compatibility matrix step. CUDA kernels are hardware-specific: a kernel optimized for H100 may be silently unusable on A100 or consumer GPUs. Always populate the .toml configuration file with your exact GPU generation, CUDA version, and library versions before the agent starts generating. Load a Skill file that includes reference kernels for your specific hardware pairing. Check the Hugging Face Kernels Hub for pre-existing kernels targeting your GPU before writing from scratch.

Question 10

What if AutoLab agents start running duplicate or low-value experiments?

Accepted Answer

This happens when the Planner does not have adequate deduplication logic or when the Researcher generates hypotheses without checking past results. The framework addresses this by having the Reporter/Reviewer agent reject duplicates or stale ideas before they reach Workers. Ensure your Planner maintains a structured queue that tracks all past experiments and their results. Use the scores data structure on the main branch as the single source of truth for what has already been tried.

Question 11

How does this framework compare to just prompting ChatGPT or Claude to write CUDA kernels?

Accepted Answer

Prompting a general-purpose LLM for CUDA kernels is zero-shot — the model has no project-specific context, benchmarking scripts, or hardware compatibility constraints. Burtenshaw's framework converts this to few-shot by loading a Skill file with reference kernels, test harnesses, and .toml metadata. The agent also runs in a loop with the ability to benchmark and iterate, whereas a single ChatGPT prompt produces a static output with no hardware validation. The framework targets a specific GPU generation; ad-hoc prompting does not.

Question 12

How does AutoLab compare to existing experiment tracking tools like Weights & Biases?

Accepted Answer

The key difference is data layer openness. Weights & Biases uses a proprietary API that agents cannot easily query or inspect programmatically without going through the W&B SDK. AutoLab uses Trackio, which stores metrics as Parquet files — fully open and directly accessible to agents. AutoLab also includes the full research loop (hypothesis generation, experiment queuing, execution, reporting), whereas W&B is primarily a tracking and visualization tool that does not drive the research process itself.

Question 13

How does using Skills compare to using RAG for giving agents context?

Accepted Answer

Skills are curated, file-based context documents maintained by project owners — they provide targeted examples, scripts, and references for a specific task. RAG retrieves chunks from a broader knowledge base based on similarity search, which can introduce noise or miss critical context. Skills offer more reliable few-shot guidance because they are hand-curated and task-specific, whereas RAG is better for broad knowledge retrieval. For engineering tasks requiring precision (like kernel writing), Skills outperform generic RAG.

Question 14

Can I use this framework with consumer GPUs instead of H100s or A100s?

Accepted Answer

Yes, but the compatibility matrix becomes even more critical. Consumer GPUs have different memory bandwidths, CUDA core counts, and supported CUDA versions than data center GPUs. You must specify your exact consumer GPU in the .toml file, and the Skill file should include reference kernels tested on similar hardware. Expect different speedup baselines — a 94% speedup on H100 does not translate directly to consumer hardware. Check the Kernels Hub for existing consumer-GPU-compatible kernels before generating new ones.

Question 15

What types of experiments work best with the AutoLab pattern?

Accepted Answer

Experiments that are verifiable, independent, and produce a single measurable metric work best. Examples include: comparing learning rate schedules measured by validation loss, testing activation function swaps measured by bits-per-byte, or evaluating architecture modifications measured by inference latency. If experiments are tightly coupled (where one depends on another's output), AutoLab's parallelism offers less benefit. The framework's core principle is that verifiable experiments are the foundation — without an objective scoring metric, autonomous agents cannot rank and select improvements.

Question 16

How do I scale AutoLab beyond a single research question?

Accepted Answer

Maintain multiple Git branches and experiment queues, each addressing a different research question or model architecture. The Planner agent can manage multiple queues simultaneously, and Workers can pull from any active queue. Use Trackio labels and HF Jobs tags to separate experiment families so the Reporter can generate per-question dashboards. The Parquet data layer supports arbitrary filtering, so you can slice results by research question, model, or date range without restructuring your infrastructure.

Question 17

What is the 'open primitives over abstracted APIs' principle in practice?

Accepted Answer

In practice, this means choosing tools where the agent can directly access the underlying data and control mechanisms — file systems, CLI tools, Parquet stores, Git repos — rather than tools that force interaction through high-level SDKs or proprietary APIs. For example, use Trackio (Parquet-backed) instead of a dashboard with a REST-only API. Use Git for shared state instead of a managed project management tool. Any abstraction the agent cannot get behind becomes a hard ceiling on its capabilities.

Question 18

Do I need to know CUDA programming to use Boss 1?

Accepted Answer

You do not need to write CUDA from scratch, but you need enough understanding to evaluate the agent's output and interpret benchmarking results. The Skill file provides reference examples and test scripts that guide the agent, and the .toml compatibility check validates hardware targeting. However, you should understand concepts like arithmetic intensity, memory bandwidth bottlenecks, and thread block sizing to effectively guide the agent interactively and catch errors in generated kernels.

Question 19

What happens if a Worker agent submits a training job that fails on HF Jobs?

Accepted Answer

The Reporter agent monitors all running jobs and should detect failures through Trackio's events and notification layer. When a failure is detected, the Reporter flags it, and the Planner can either retry the job with modified parameters or remove it from the queue. Store failure logs in the shared Git repo so the Researcher can avoid proposing similar hypotheses in future iterations. The framework's emphasis on open data layers means failure information is always accessible to all agents.

Question 20

Can I use this framework with models not hosted on Hugging Face Hub?

Accepted Answer

The framework is designed around Hugging Face infrastructure — Hub repos, HF Jobs, HF CLI, and the Kernels library. You can adapt it for other platforms, but you would need to replace the Hub-specific components with equivalents that maintain open primitives. The core principles (Skills for few-shot context, open data layers, verifiable experiments, multi-agent distribution) are platform-agnostic, but the specific workflow steps and tooling assume HF Hub as the backbone.

Question 21

How long does a typical AutoLab run take?

Accepted Answer

AutoLab runs are designed to execute overnight or over multiple hours, with Workers running experiments in parallel on HF Jobs. The actual duration depends on experiment complexity — a batch of hyperparameter sweeps on a small model might complete in a few hours, while architecture modifications on larger models could take a full day. The Gantt chart generated from Trackio's Parquet data layer helps visualize agent timelines and identify bottlenecks in the pipeline.

Question 22

Why shouldn't I use ad-hoc prompts instead of maintained Skill files?

Accepted Answer

Ad-hoc prompts degrade quickly as models, APIs, and project structures evolve. They lack versioning, benchmarking scripts, and reference examples, so agents operate in zero-shot mode with unreliable guidance. Maintained Skill files are version-controlled, include test harnesses, and are updated by the team that owns the relevant project. The upskill library can evaluate Skill quality across models, which is impossible with ad-hoc prompts. Unmaintained Skills lead to unnecessary cost and inconsistent agent output.

Frequently Asked Questions About Burtenshaw AI Systems Engineering via Coding Agents

// Basics