A Neural Scaling Law from Lottery Ticket Ensembling
Ziming Liu, Max Tegmark
TL;DR
The paper investigates why model performance scales with size beyond classical NSL theories and introduces lottery ticket ensembling as a new mechanism. Using a minimal two-layer setup to fit $y=x^2$, it provides empirical evidence that wider networks contain more lottery tickets whose ensembling yields a central-limit-type $N^{-1}$ scaling, supported by symmetric-neuron observations and a formal variance-reduction argument. The key contributions are the identification of lottery tickets in wide nets, a central-limit-theorem–style justification for $N^{-1}$ scaling, and an analysis of ensembling versus synergy with implications for large models and physics-inspired learning theories. This work offers a new lens on scaling behavior, linking empirical ticket-level structure to macroscopic performance and suggesting avenues for pruning, distillation, and theory development in deep learning.
Abstract
Neural scaling laws (NSL) refer to the phenomenon where model performance improves with scale. Sharma & Kaplan analyzed NSL using approximation theory and predict that MSE losses decay as $N^{-α}$, $α=4/d$, where $N$ is the number of model parameters, and $d$ is the intrinsic input dimension. Although their theory works well for some cases (e.g., ReLU networks), we surprisingly find that a simple 1D problem $y=x^2$ manifests a different scaling law ($α=1$) from their predictions ($α=4$). We opened the neural networks and found that the new scaling law originates from lottery ticket ensembling: a wider network on average has more "lottery tickets", which are ensembled to reduce the variance of outputs. We support the ensembling mechanism by mechanistically interpreting single neural networks, as well as studying them statistically. We attribute the $N^{-1}$ scaling law to the "central limit theorem" of lottery tickets. Finally, we discuss its potential implications for large language models and statistical physics-type theories of learning.
