Table of Contents
Fetching ...

Scaling Laws for Imitation Learning in Single-Agent Games

Jens Tuyls, Dhruv Madeka, Kari Torkkola, Dean Foster, Karthik Narasimhan, Sham Kakade

TL;DR

The paper shows that imitation learning in single-agent games scales predictably with compute, revealing power-law relationships between loss, returns, model size, and data size across Atari and the challenging NetHack. By deriving isoFLOP and parametric scaling laws, it links loss to performance and forecasts compute-optimal configurations that outperform prior NetHack baselines by substantial margins. The results extend to reinforcement learning settings and emphasize how partial observability and information parity influence scaling, with practical implications for data and compute planning. Overall, carefully scaling model and data size can yield large, predictable gains in BC performance, suggesting broad applicability of scaling laws in IL/RL.

Abstract

Imitation Learning (IL) is one of the most widely used methods in machine learning. Yet, many works find it is often unable to fully recover the underlying expert behavior, even in constrained environments like single-agent games. However, none of these works deeply investigate the role of scaling up the model and data size. Inspired by recent work in Natural Language Processing (NLP) where "scaling up" has resulted in increasingly more capable LLMs, we investigate whether carefully scaling up model and data size can bring similar improvements in the imitation learning setting for single-agent games. We first demonstrate our findings on a variety of Atari games, and thereafter focus on the extremely challenging game of NetHack. In all games, we find that IL loss and mean return scale smoothly with the compute budget (FLOPs) and are strongly correlated, resulting in power laws for training compute-optimal IL agents. Finally, we forecast and train several NetHack agents with IL and find they outperform prior state-of-the-art by 1.5x in all settings. Our work both demonstrates the scaling behavior of imitation learning in a variety of single-agent games, as well as the viability of scaling up current approaches for increasingly capable agents in NetHack, a game that remains elusively hard for current AI systems.

Scaling Laws for Imitation Learning in Single-Agent Games

TL;DR

The paper shows that imitation learning in single-agent games scales predictably with compute, revealing power-law relationships between loss, returns, model size, and data size across Atari and the challenging NetHack. By deriving isoFLOP and parametric scaling laws, it links loss to performance and forecasts compute-optimal configurations that outperform prior NetHack baselines by substantial margins. The results extend to reinforcement learning settings and emphasize how partial observability and information parity influence scaling, with practical implications for data and compute planning. Overall, carefully scaling model and data size can yield large, predictable gains in BC performance, suggesting broad applicability of scaling laws in IL/RL.

Abstract

Imitation Learning (IL) is one of the most widely used methods in machine learning. Yet, many works find it is often unable to fully recover the underlying expert behavior, even in constrained environments like single-agent games. However, none of these works deeply investigate the role of scaling up the model and data size. Inspired by recent work in Natural Language Processing (NLP) where "scaling up" has resulted in increasingly more capable LLMs, we investigate whether carefully scaling up model and data size can bring similar improvements in the imitation learning setting for single-agent games. We first demonstrate our findings on a variety of Atari games, and thereafter focus on the extremely challenging game of NetHack. In all games, we find that IL loss and mean return scale smoothly with the compute budget (FLOPs) and are strongly correlated, resulting in power laws for training compute-optimal IL agents. Finally, we forecast and train several NetHack agents with IL and find they outperform prior state-of-the-art by 1.5x in all settings. Our work both demonstrates the scaling behavior of imitation learning in a variety of single-agent games, as well as the viability of scaling up current approaches for increasingly capable agents in NetHack, a game that remains elusively hard for current AI systems.
Paper Structure (54 sections, 18 equations, 17 figures, 6 tables)

This paper contains 54 sections, 18 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: BC loss scaling. We train a wide range of model sizes across several orders of magnitudes of FLOP budgets. We plot the validation loss for each model, with fitted parabolas per IsoFLOPs curve (a). We then regress the loss minima (b), the loss-optimal number of parameters (c), and the loss-optimal number of samples (d) on their corresponding FLOP budgets. We find clear power law trends for Nethack (first column), Battle Zone (middle column), and Breakout (last column). The full list of Atari results can be found in \ref{['appendix:atari-full-results']}.
  • Figure 2: BC return scaling. We train a wide range of model sizes across several orders of magnitudes of FLOP budgets (same models as in \ref{['fig:bc-loss-iso-flop']}) and plot their average return in the environment (a). We then regress the optimal returns (b), the return-optimal number of parameters (c), and the return-optimal number of samples (d) on their corresponding FLOP budgets. We find mostly clear power law trends for Nethack (left), Battle Zone (middle), and Breakout (right). Full Atari results can be found in \ref{['appendix:atari-full-results']}.
  • Figure 2: Comparison with baselines. We compare our best BC model with previous models in the NetHackChallenge-v0 environment and find it outperforms all of them on the human monk role in the offline setting. $^*$Exact scores not reported. See \ref{['appendix:full-forecasting-results']} for full results with standard errors.
  • Figure 3: BC return vs. optimal loss. We investigate the relationship between the optimal loss of a BC agent and the mean return. We find they are highly correlated for all games.
  • Figure 4: RL return scaling. We train a wide range of model sizes across several orders of magnitude of FLOP budgets and plot the average return when rolled out in the environment at the end of training (a). We then regress the return-optimal average returns (b), parameters (c), and samples (d) on their corresponding FLOP budgets. We train 1 seed per point on the isoFLOP profile.
  • ...and 12 more figures