How ML Researchers Run Autonomous Experiments with AutoLab

For Applied ML researchers and research engineers · Based on Burtenshaw AI Systems Engineering via Coding Agents

// TL;DR

Applied ML researchers can use Burtenshaw's AutoLab pattern (Boss 3) to autonomously discover training improvements by running parallel experiments overnight. Set up four agent roles — Researcher, Planner, Workers, Reporter — that scan papers for ideas, queue single-change experiments, implement them as training script patches, and track results via Trackio's open Parquet data layer. AutoLab works best when experiments produce verifiable metrics like bits-per-byte or validation loss. The result is a distributed research team that explores the hypothesis space faster than any single researcher or single-agent loop.

Why should ML researchers use multi-agent systems for experiment iteration?

ML research involves testing many hypotheses — learning rate schedules, activation functions, architecture modifications, data augmentation strategies. Most researchers iterate sequentially: try one thing, wait for results, try the next. The AutoLab pattern parallelizes this process using four specialized agents that propose, implement, run, and evaluate experiments simultaneously.

The key requirement is verifiable experiments. If your training runs produce a measurable output — validation loss, bits-per-byte, accuracy, perplexity — then agents can objectively score and rank results without human judgment. This makes the AutoLab pattern straightforward to implement.

How do you set up the four AutoLab agent roles?

Researcher: This agent scans recent papers via HF Papers CLI or arXiv and formulates improvement hypotheses. It outputs structured proposals: what to change, why it might work, and the expected metric to measure.

Planner: Receives hypotheses from the Researcher and maintains a structured queue of single-change experiments. It tracks current hyperparameters, past results, and deduplicates ideas. The Planner rejects stale or duplicate proposals before passing them to Workers.

Workers: Pick up queued experiments, implement each hypothesis as a patch to the training script (architecture change, parameter modification, data pipeline adjustment), and submit them as HF Jobs on the Hub. Workers run in parallel — potentially for hours — on Hub-provisioned GPUs.

Reporter: Monitors all running jobs via Trackio, maintains the metrics dashboard, flags anomalies, and produces summary tables. Because Trackio stores data as Parquet files, the Reporter can query metrics directly without UI mediation.

Use a Git repo as shared state: the main branch holds the current best training script and a scores data structure. Each experiment runs on its own branch.

What experiments work best with the AutoLab pattern?

AutoLab excels with experiments that are:

- Verifiable: They produce a measurable metric (validation loss, bits-per-byte, inference latency)

- Independent: Each experiment tests a single change, so results don't confound each other

- Parallelizable: Multiple experiments can run simultaneously on separate compute

- Implementable as patches: Changes can be expressed as modifications to a training script

Examples: swapping learning rate schedules, replacing activation functions, modifying attention mechanisms, adding or removing regularization, changing tokenization strategies.

Experiments requiring subjective evaluation ("does the output sound more natural?") are poor fits because agents cannot objectively score them.

How do you monitor and interpret AutoLab results?

Trackio is the metrics layer. Its critical property is the open Parquet data layer — agents query it directly, build custom visualizations, and generate Gantt charts showing agent timelines and experiment scores across the run.

Use HF Jobs labels to tag, sort, and filter runs programmatically. The Reporter agent surfaces events and anomalies: training curves that diverge, jobs that fail, or experiments that exceed baseline metrics.

At the end of a run, query the Parquet data layer to rank all experiments by the target metric. Promote the best-performing patch to the main branch. Iterate by feeding results back to the Researcher for the next round of hypotheses.

Next step: Identify a model and training setup with a clear verifiable metric. Define your four agent roles and configure Trackio with Parquet storage. Run a first batch of 5-10 single-change experiments overnight and review results in the morning.

// FREQUENTLY ASKED QUESTIONS

How many experiments can AutoLab run in parallel?

The number depends on your compute budget and HF Jobs availability. Each Worker agent submits experiments as Hub jobs on provisioned GPUs. You can scale Worker agents to match available compute. Store training scripts in an HF bucket to avoid redundant uploads. Use the Planner to batch experiments and the Reporter to monitor all concurrent runs via Trackio's Parquet layer.

Can AutoLab discover genuinely novel training improvements?

AutoLab's Researcher agent scans recent papers for ideas and proposes hypotheses, but the ideas are bounded by what's published. The pattern excels at systematically exploring known improvement directions — hyperparameter tuning, architecture ablations, established techniques applied to new models. Genuinely novel ideas still require human insight, but AutoLab dramatically accelerates the testing and validation loop.

What happens if an AutoLab experiment goes off the rails?

The Reporter agent monitors all running jobs and flags anomalies — diverging loss curves, failed jobs, or results far outside expected ranges. Trackio's events and notification layer can alert you in real time. Because each experiment runs on its own Git branch, a failed or harmful experiment is isolated and cannot corrupt the main training script.