How Academic ML Researchers Run Autonomous Experiments with AutoLab
For Applied ML researchers at academic labs · Based on Burtenshaw AI Systems Engineering via Coding Agents
// TL;DR
Academic ML researchers can use the AutoLab pattern (Boss 3) to autonomously discover training improvements by distributing the research process across four agent roles: Researcher, Planner, Workers, and Reporter. Define a verifiable metric (bits-per-byte, validation loss), set up a Git repo as shared state, and let agents scan papers, propose hypotheses, implement single-change experiments as training script patches, and submit them as parallel HF Jobs. Trackio stores all metrics as Parquet files, enabling researchers to query results programmatically. AutoLab works best when experiments are independent and verifiable.
Why should academic researchers use multi-agent systems for ML experiments?
ML research involves a massive search space: hyperparameter configurations, architecture variations, training schedules, and data augmentation strategies. A single researcher or a single iterating agent can only explore this space sequentially. AutoLab distributes the research process across four specialized agents — Researcher, Planner, Workers, Reporter — that operate in parallel, proposing and running experiments overnight while the human researcher sleeps.
The critical requirement is a verifiable experiment: a measurable output metric (bits-per-byte, validation loss, inference latency) that agents can objectively score and rank. If your research question produces a number, AutoLab can explore it autonomously.
How do you architect the AutoLab agent team?
Define four agent roles with clear responsibilities:
- Researcher: Scans recent papers via HF Papers CLI or arXiv, formulates improvement hypotheses based on the current training setup.
- Planner: Receives hypotheses and maintains a structured experiment queue. Tracks current hyperparameters, past results, and which ideas are worth pursuing. Rejects duplicates or stale ideas.
- Workers: Pick up queued experiments, implement each as a single-change patch to the training script (e.g., swap activation function, adjust learning rate schedule), and submit as HF Jobs.
- Reporter: Monitors all running jobs via Trackio, maintains the metrics dashboard, flags anomalies, and produces summary tables.
Use a Git repo as the shared state: the main branch holds the current best training script and a scores data structure. Each experiment runs on its own branch. When an experiment outperforms the current best, its patch is promoted to main.
How do you ensure experiments are valid and not wasted compute?
Three safeguards: First, every experiment must be a single-change modification — this isolates the variable being tested. Second, the Planner deduplicates against the scores data structure to prevent re-running past experiments. Third, Trackio's open Parquet data layer lets you (or the Reporter agent) query all historical results programmatically, generating Gantt charts to visualize timelines and spotting anomalies early.
Use the upskill library to validate any Skills used in the pipeline before scaling — it evaluates Skill quality across multiple models and reports accuracy/token trade-offs.
What does a typical AutoLab research cycle look like?
The Planner proposes a first batch of single-change experiments based on the Researcher's hypotheses. Workers run them in parallel as HF Jobs — potentially for hours. The Reporter monitors progress via Trackio, alerting if agents go off-course. At the end of the cycle, query the Parquet data layer to rank experiments by your verifiable metric and promote the best patch to main. Then iterate: the Researcher scans new papers informed by current results, and the cycle continues.
What's the next step?
Identify a verifiable metric for your current research question. Set up a Git repo with your training script on the main branch. Configure Trackio for metrics storage and define the four agent roles. Start with a small batch of 3-5 single-change experiments to validate the pipeline before scaling to overnight runs.
// FREQUENTLY ASKED QUESTIONS
Does AutoLab work for any ML research question?
AutoLab works best when experiments are verifiable (produce a measurable metric), independent (don't depend on each other's results), and implementable as single-change patches to a training script. Research questions requiring sequential, dependent experiments or qualitative evaluation are less suited to the AutoLab pattern.
How many experiments can AutoLab run in parallel?
The number of parallel experiments depends on your HF Jobs compute budget and the number of Worker agents you configure. Each Worker picks up one job at a time and submits it to HF Jobs. In principle, you can run as many parallel experiments as your compute quota allows. The Planner manages the queue to ensure Workers always have valid experiments to execute.
Can I use AutoLab with my own compute cluster instead of HF Jobs?
The framework is designed around HF Jobs for compute provisioning, but the core principles are platform-agnostic. You would need to replace HF Jobs with your cluster's job submission system while maintaining open primitives — ensure the agent can programmatically submit, monitor, and retrieve results from jobs without opaque API layers.