Scaling Laws for Pre-training Agents and World Models
Tim Pearce, Tabish Rashid, Dave Bignell, Raluca Georgescu, Sam Devlin, Katja Hofmann
TL;DR
This work investigates scaling laws for embodied AI by treating pre-training loss as a proxy for downstream performance in world modeling and behavior cloning, using offline data and GPT-2–style transformers with two input schemes (tokenized observations via VQGAN and CNN-based continuous embeddings). It demonstrates that power-law relationships between compute, model size, and data exist in these domains, with coefficients that depend on tokenizer compression, architecture, and task. The results reveal that world modeling scaling closely mirrors LLM scaling under certain tokenization settings, while BC scaling is highly sensitive to input representation and can shift toward either more data or more compute depending on the architecture. The study also shows extrapolative validity of the scaling laws and provides mechanistic insights (via controlled ablations) into why different BC/WM setups exhibit distinct scaling behaviors, offering practical guidance for compute-efficient design in embodied AI systems.
Abstract
The performance of embodied agents has been shown to improve by increasing model parameters, dataset size, and compute. This has been demonstrated in domains from robotics to video games, when generative learning objectives on offline datasets (pre-training) are used to model an agent's behavior (imitation learning) or their environment (world modeling). This paper characterizes the role of scale in these tasks more precisely. Going beyond the simple intuition that `bigger is better', we show that the same types of power laws found in language modeling also arise in world modeling and imitation learning (e.g. between loss and optimal model size). However, the coefficients of these laws are heavily influenced by the tokenizer, task \& architecture -- this has important implications on the optimal sizing of models and data.
