Table of Contents
Fetching ...

Scaling Laws for Pre-training Agents and World Models

Tim Pearce, Tabish Rashid, Dave Bignell, Raluca Georgescu, Sam Devlin, Katja Hofmann

TL;DR

This work investigates scaling laws for embodied AI by treating pre-training loss as a proxy for downstream performance in world modeling and behavior cloning, using offline data and GPT-2–style transformers with two input schemes (tokenized observations via VQGAN and CNN-based continuous embeddings). It demonstrates that power-law relationships between compute, model size, and data exist in these domains, with coefficients that depend on tokenizer compression, architecture, and task. The results reveal that world modeling scaling closely mirrors LLM scaling under certain tokenization settings, while BC scaling is highly sensitive to input representation and can shift toward either more data or more compute depending on the architecture. The study also shows extrapolative validity of the scaling laws and provides mechanistic insights (via controlled ablations) into why different BC/WM setups exhibit distinct scaling behaviors, offering practical guidance for compute-efficient design in embodied AI systems.

Abstract

The performance of embodied agents has been shown to improve by increasing model parameters, dataset size, and compute. This has been demonstrated in domains from robotics to video games, when generative learning objectives on offline datasets (pre-training) are used to model an agent's behavior (imitation learning) or their environment (world modeling). This paper characterizes the role of scale in these tasks more precisely. Going beyond the simple intuition that `bigger is better', we show that the same types of power laws found in language modeling also arise in world modeling and imitation learning (e.g. between loss and optimal model size). However, the coefficients of these laws are heavily influenced by the tokenizer, task \& architecture -- this has important implications on the optimal sizing of models and data.

Scaling Laws for Pre-training Agents and World Models

TL;DR

This work investigates scaling laws for embodied AI by treating pre-training loss as a proxy for downstream performance in world modeling and behavior cloning, using offline data and GPT-2–style transformers with two input schemes (tokenized observations via VQGAN and CNN-based continuous embeddings). It demonstrates that power-law relationships between compute, model size, and data exist in these domains, with coefficients that depend on tokenizer compression, architecture, and task. The results reveal that world modeling scaling closely mirrors LLM scaling under certain tokenization settings, while BC scaling is highly sensitive to input representation and can shift toward either more data or more compute depending on the architecture. The study also shows extrapolative validity of the scaling laws and provides mechanistic insights (via controlled ablations) into why different BC/WM setups exhibit distinct scaling behaviors, offering practical guidance for compute-efficient design in embodied AI systems.

Abstract

The performance of embodied agents has been shown to improve by increasing model parameters, dataset size, and compute. This has been demonstrated in domains from robotics to video games, when generative learning objectives on offline datasets (pre-training) are used to model an agent's behavior (imitation learning) or their environment (world modeling). This paper characterizes the role of scale in these tasks more precisely. Going beyond the simple intuition that `bigger is better', we show that the same types of power laws found in language modeling also arise in world modeling and imitation learning (e.g. between loss and optimal model size). However, the coefficients of these laws are heavily influenced by the tokenizer, task \& architecture -- this has important implications on the optimal sizing of models and data.

Paper Structure

This paper contains 29 sections, 7 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: This paper observes that scaling laws, as originally found in LLMs, also emerge in the tasks of world modeling and BC, when studying the pre-training loss on large datasets of human behavior. (a) (b) For world modeling, the power law coefficient determining optimal model size is affected by the compression rate of the tokenizer. (c) In BC with tokenized image observations (BC-Token), small models need a large FLOPs budget to saturate, making these scaling laws less clear cut. (d) However, moving to a single continuous embedding per observation remedies this (BC-CNN), producing prototypical scaling laws and a more balanced optimal model size coefficient.
  • Figure 2: Our meta-analysis of tuyls2023scalingimitation evidences that pre-training loss is strongly correlated with reward in BC tasks when in the infinite data regime.
  • Figure 3: Our experiments suggest pre-training loss is a good proxy for world model quality. Further details in Appendix \ref{['sec_app_pretrain_evidence']}.
  • Figure 4: The World Modelling (WM) and Behavior Cloning (BC) tasks & architecture combinations considered in this work. The fire symbol signifies trainable components, the ice symbol signifies frozen pre-trained components.
  • Figure 5: Example trajectories from a dataset of 8.6 years of human gameplay in the video game Bleeding Edge across 7 maps.
  • ...and 11 more figures