How Should AI Researchers Design Models That Compound With Products?
For AI researchers and ML engineers · Based on Emit Jane Luma Foundation Lab Method
// TL;DR
AI researchers in a foundation lab must design models that are jointly optimized with products. This means training on process data captured from real product deployments, building unified single-tower architectures instead of separate modality towers, and applying the 10x logarithmic scaling test before every major training investment. Every model improvement should directly improve the product, and every product deployment should generate the scarce training data (especially process data) that the internet cannot supply. Research that doesn't feed the product loop is misaligned with foundation lab principles.
How should AI researchers think about model architecture in a foundation lab?
Build a unified single-tower model, not separate towers per modality. This is the Foundation Lab Method's core architectural principle.
A unified model—one backbone fusing language, image, video, and audio tokens into a single jointly trained system—enables things categorically impossible with separate modality towers. Understanding a character's identity across a long film production, reasoning about visual states referenced in code, comprehending physical causality through combined language and video signals—none of this emerges from stitching together separate models.
The priority order for modality fusion: language + video + audio covers approximately 90% of the path to a world model. Start with the highest-leverage fusion (language + image or language + video) and expand. At each fusion step, test whether the combination enables categorically new capabilities—not just incremental improvements.
This is admittedly hard. Unified models are described as 'ridiculously hard to train.' But it is the only path to end-to-end optimization and the only architecture that can become a world model.
How do I decide when to scale vs. when to fix architecture or data?
Apply the 10x logarithmic scaling test before every major training investment. Ask: if the next model were 10x larger in compute and parameters, would it be a categorically different thing—not just incrementally better?
If the answer isn't an obvious yes, the bottleneck is not scale. Diagnose the real constraint:
- Insufficient modality coverage: Is the model missing audio, video, or language towers that should be fused?
- Data quality or process data gaps: Is the training data only artifacts (finished outputs) without the process that created them?
- Architectural limitations: Are separate modality towers preventing joint optimization?
Fix the actual constraint before committing compute budget. Scaling alone cannot solve architectural or data quality problems—it just makes incrementally better versions of a fundamentally limited system.
What kind of training data should AI researchers prioritize?
Process data over artifact data, always.
The internet gives you artifacts—finished movies, images, code repositories. It does not give you how those artifacts were made: the actions, iterations, decisions, undos, and refinements that led to the final output. End-to-end agents that do real work for real professions require this process data.
In a foundation lab, this data comes from the product. Every interaction in your deployed product is a training signal. Design the product logging infrastructure so that the full path to every artifact—not just the final output—is captured and usable for training. Forward Deployed Creatives in enterprise deployments are especially valuable: they observe the best practitioners using the system and pipe that intelligence directly back to research.
If there's no internet-scale dataset for your modality at all, the first product must be a data generation engine—something people love to use for free that produces training data at scale.
How do researchers and product teams work together in a foundation lab?
They don't 'work together'—they are one team. This is the most radical organizational claim of the Foundation Lab Method.
In practice, this means researchers don't optimize benchmarks in isolation. Every training run is designed to improve specific product capabilities. Every model gap identified by the product is translated into a data collection task for the next training run—typically achievable in two to three weeks, not months. Researchers review product telemetry daily. Product decisions are made by people who understand model training constraints.
The shared metric is the compound loop velocity: how fast does product usage → training data → better model → better product → more usage cycle? Research that doesn't feed this loop is misaligned, no matter how novel the technique.
Start by auditing your current research priorities against the product roadmap. Identify every training run that doesn't directly improve a product capability, every capability gap being solved with engineering harnesses instead of training data, and every product interaction whose process data isn't being captured for training.
// FREQUENTLY ASKED QUESTIONS
Why should AI researchers build unified models instead of separate modality models?
Separate modality towers cannot jointly optimize, preventing the model from developing true physical world understanding. A unified single tower—one backbone processing language, audio, video, and images as one signal stream—enables categorically new capabilities impossible with separate models. It's the only architecture that can become a world model. Language + video + audio covers about 90% of the path. Start with the highest-leverage fusion and test whether each combination enables categorically new things.
How do I know if my AI model's limitations are a scaling problem or something else?
Apply the 10x logarithmic test: would a 10x increase in compute and parameters make the model categorically different, not just incrementally better? If the answer isn't obviously yes, the constraint is architectural (separate modality towers), data quality (artifacts without process data), or missing modality coverage. Fix the real bottleneck before investing in scale. Scaling solves scaling problems; it doesn't fix architectural limitations or missing training signal types.
What is process data and how do I collect it for model training?
Process data captures how an artifact was made—actions, iterations, decisions, and refinements—not just the finished output. Collect it by deploying your product to real users and logging every interaction path. Forward Deployed Creatives in enterprise settings observe expert practitioners and pipe intelligence back to research. Design your product's logging infrastructure to capture the full creation journey. The internet supplies finished artifacts; only your deployed product can supply the process data that end-to-end agents need.