Table of Contents
Fetching ...

AlphaZero Neural Scaling and Zipf's Law: a Tale of Board Games and Power Laws

Oren Neumann, Claudius Gros

TL;DR

This paper examines power-law scaling in AlphaZero, a reinforcement learning algorithm, using a model of language-model scaling and finds that game states in training and inference data scale with Zipf's law, which is known to arise from the tree structure of the environment.

Abstract

Neural scaling laws are observed in a range of domains, to date with no universal understanding of why they occur. Recent theories suggest that loss power laws arise from Zipf's law, a power law observed in domains like natural language. One theory suggests that language scaling laws emerge when Zipf-distributed task quanta are learned in descending order of frequency. In this paper we examine power-law scaling in AlphaZero, a reinforcement learning algorithm, using a model of language-model scaling. We find that game states in training and inference data scale with Zipf's law, which is known to arise from the tree structure of the environment, and examine the correlation between scaling-law and Zipf's-law exponents. In agreement with the quanta scaling model, we find that agents optimize state loss in descending order of frequency, even though this order scales inversely with modelling complexity. We also find that inverse scaling, the failure of models to improve with size, is correlated with unusual Zipf curves where end-game states are among the most frequent states. We show evidence that larger models shift their focus to these less-important states, sacrificing their understanding of important early-game states.

AlphaZero Neural Scaling and Zipf's Law: a Tale of Board Games and Power Laws

TL;DR

This paper examines power-law scaling in AlphaZero, a reinforcement learning algorithm, using a model of language-model scaling and finds that game states in training and inference data scale with Zipf's law, which is known to arise from the tree structure of the environment.

Abstract

Neural scaling laws are observed in a range of domains, to date with no universal understanding of why they occur. Recent theories suggest that loss power laws arise from Zipf's law, a power law observed in domains like natural language. One theory suggests that language scaling laws emerge when Zipf-distributed task quanta are learned in descending order of frequency. In this paper we examine power-law scaling in AlphaZero, a reinforcement learning algorithm, using a model of language-model scaling. We find that game states in training and inference data scale with Zipf's law, which is known to arise from the tree structure of the environment, and examine the correlation between scaling-law and Zipf's-law exponents. In agreement with the quanta scaling model, we find that agents optimize state loss in descending order of frequency, even though this order scales inversely with modelling complexity. We also find that inverse scaling, the failure of models to improve with size, is correlated with unusual Zipf curves where end-game states are among the most frequent states. We show evidence that larger models shift their focus to these less-important states, sacrificing their understanding of important early-game states.

Paper Structure

This paper contains 47 sections, 23 equations, 16 figures.

Figures (16)

  • Figure 1: Zipf's law in AlphaZero games. Board-state frequency follows a power law in state rank, here for Connect Four and Pentago. Similar exponents $\alpha$ appear for agents of various sizes trained on different games, see appendix \ref{['appendix:zipf_curves']}.
  • Figure 2: Zipf power laws emerging from the tree structure of board games.A: In a toy-model game where board states branch out symmetrically, state frequency is organized as a series of plateaus centered around a linear line. B: Adding game rules, but still playing random moves, the plateaus are smoothed out, producing a power law with a slightly different exponent. This is in particular evident for Connect Four, which has a much lower branching factor than Pentago. The tail plateaus are caused by the finite amount of data.
  • Figure 3: Measuring correlation between Zipf's law and scaling exponents. By changing the move-selection temperature at inference time, both the state-distribution Zipf's law and the size-scaling law are augmented, plotted here for Connect Four. A: As $T \to 0$, the Zipf curve starts to bend, following a steeper power law at high-ranks. B:$T$ also changes the Connect Four size scaling law, either directly by lowering policy quality at high $T$, or indirectly by changing the game state distribution. C: By modulating $T$ at low values, we plot the dependence of the scaling power law on Zipf's law, using the tail exponent.
  • Figure 4: Connect Four value-lossscaling with rank.A: Average loss on each agent's own training data. Loss steadily increases with rank. Larger agents achieve better loss. B: Loss on a dataset annotated by a game solver. Larger models are closer to the ground truth values of optimal play. The overall trend of loss increasing with rank is surprising, when one considers that the complexity of states decreases with rank. See appendix \ref{['appendix:ground_truth_loss']} for detail on error bars. C: The time needed to evaluate the same states using alpha-beta pruning connect4solver drops on an exponential scale with rank (mean and standard deviation are geometric).
  • Figure 5: Inverse scaling in AlphaZero.A: Agents playing Checkers and Oware follow a size scaling law that abruptly changes direction, when large models fail to utilize their capacity. The scaling curve does not flip due to approaching the perfect-play ceiling: training a suite of Oware agents with different hyperparameters (light green), they extend to higher Elo scores. Error bars are one standard deviation. B: Oware games played by AlphaZero follow Zipf's law, interrupted by a small plateau. This is caused by high-frequency late-game states, present due to the unusual tree-structure of Oware, see Fig. \ref{['fig:turns']}. C: Checkers shares the same tree structure, but the effect on the Zipf curve is less visible.
  • ...and 11 more figures