How AI Startup Founders Use Coding Agents to Ship Models Faster

For AI startup founders and technical leads · Based on Burtenshaw AI Systems Engineering via Coding Agents

// TL;DR

AI startup founders with small teams can use Burtenshaw's framework to do the work of a larger ML engineering org. Boss 2 (zero-shot fine-tuning) lets you fine-tune models with plain-language instructions — the agent handles script generation, job submission, and GPU provisioning. Boss 1 (kernel optimization) reduces inference costs on your serving hardware. Boss 3 (AutoLab) runs parallel experiments overnight to improve model quality autonomously. The framework's emphasis on open primitives and file-based Skills ensures agents stay effective without requiring a large team to manage prompt infrastructure.

Why should AI startup founders care about coding agents for ML engineering?

Small AI startups face a fundamental resource constraint: they need to fine-tune models, optimize inference, and iterate on model quality — work that traditionally requires dedicated ML engineers, infrastructure engineers, and researchers. Burtenshaw's framework lets coding agents perform much of this work with minimal human oversight.

The three tiers map directly to startup needs:

- Boss 1 (CUDA kernels): Reduce inference costs by optimizing for your specific serving GPU

- Boss 2 (Fine-tuning): Customize models for your domain without writing training code

- Boss 3 (AutoLab): Discover model improvements by running experiments overnight

The key advantage for startups is that Skills — file-based context documents — compound over time. As your team creates and maintains Skills, every future agent task becomes more reliable and efficient.

How do you fine-tune a model without a dedicated ML engineer?

Boss 2 is designed for exactly this scenario. Load HF CLI skills and issue a plain-language instruction: "Fine-tune Qwen 3 8B on our customer support dataset." The coding agent generates the training script, submits it as an HF Job, and handles GPU provisioning on the Hub.

After training, use upskill to evaluate whether the resulting model works well across cheaper inference options. This prevents you from being locked into expensive API calls — if the fine-tuned model performs well with a cheaper serving setup, you switch and reduce costs.

The critical step many founders skip: use upskill to validate your Skills before scaling. Deploying a Skill broadly without evaluating it across models leads to unnecessary cost and degraded quality.

How do you reduce inference costs with agent-written kernels?

If you're serving a model on specific GPUs (say, A100s on your cloud provider), the model likely wasn't optimized for that exact hardware. Check the Hugging Face Kernels Hub for existing compatible kernels — this is the lowest-effort, highest-impact optimization.

If no kernel exists, load a kernel-writing Skill and have the agent generate one. The Skill provides reference examples and benchmarking scripts so the agent doesn't start from scratch. Configure the `.toml` file with your hardware profile. Benchmark the output kernel — 94% speedup on hardware-specific pairings is a realistic target.

The principle to internalize: memory bandwidth is the bottleneck, not compute. Agent-written kernels should increase arithmetic intensity — more math per memory read/write — rather than maximizing raw FLOPs.

How do you use AutoLab to improve model quality overnight?

Once you have a working fine-tuned model and a verifiable metric (accuracy on your eval set, latency, user satisfaction proxy), set up AutoLab to explore improvements autonomously.

The four agents — Researcher, Planner, Workers, Reporter — propose hypotheses from recent papers, queue single-change experiments, run them as HF Jobs, and track results in Trackio. You review results in the morning and promote winning changes.

For startups, the Planner's deduplication logic is crucial. Without it, agents waste compute re-running the same experiments. Use HF Jobs labels so your Reporter can track costs per experiment — essential for a budget-constrained team.

The non-negotiable requirement: use tools with open data layers. Trackio stores metrics as Parquet files that agents can query directly. Any tool that hides data behind a proprietary API limits what your agents can do autonomously.

Next step: Start with Boss 2 — fine-tune your first model using a plain-language instruction and HF CLI skills. Once the model is serving, move to Boss 1 to optimize inference costs on your target GPU. Graduate to Boss 3 when you have a clear verifiable metric and want to explore improvements at scale.

// FREQUENTLY ASKED QUESTIONS

How much ML experience do I need to use this framework as a founder?

You need enough ML understanding to define your problem (what model, what dataset, what metric), select the right Boss tier, and evaluate agent output. The framework offloads script generation, job submission, and experiment management to agents. Deep CUDA or training expertise is not required for Boss 2 (fine-tuning) but helps for Boss 1 (kernels). Skills files reduce the expertise bar by providing reference examples and benchmarking scripts.

What does this cost to run for a small startup?

Costs depend on compute usage. Boss 2 fine-tuning runs on HF Jobs — you pay for the GPU hours. Boss 1 kernel optimization is a one-time investment that reduces ongoing inference costs. Boss 3 AutoLab costs scale with the number of parallel experiments. Use upskill to find cheaper model alternatives and HF Jobs labels to track per-experiment costs. The framework's emphasis on open primitives means you're not locked into expensive proprietary platforms.

Can I use this framework if my product doesn't use Hugging Face models?

The principles — Skills for few-shot guidance, open primitives, verifiable experiments, multi-agent teams — are model-agnostic. However, the workflow is tightly integrated with Hugging Face infrastructure (Hub, HF Jobs, Trackio, Kernels repo). If you use non-HF models, you can still apply the principles but will need to substitute equivalent tools that expose open data layers and avoid opaque APIs.