A Neural Scaling Law from Lottery Ticket Ensembling

Ziming Liu; Max Tegmark

A Neural Scaling Law from Lottery Ticket Ensembling

Ziming Liu, Max Tegmark

TL;DR

The paper investigates why model performance scales with size beyond classical NSL theories and introduces lottery ticket ensembling as a new mechanism. Using a minimal two-layer setup to fit $y=x^2$, it provides empirical evidence that wider networks contain more lottery tickets whose ensembling yields a central-limit-type $N^{-1}$ scaling, supported by symmetric-neuron observations and a formal variance-reduction argument. The key contributions are the identification of lottery tickets in wide nets, a central-limit-theorem–style justification for $N^{-1}$ scaling, and an analysis of ensembling versus synergy with implications for large models and physics-inspired learning theories. This work offers a new lens on scaling behavior, linking empirical ticket-level structure to macroscopic performance and suggesting avenues for pruning, distillation, and theory development in deep learning.

Abstract

Neural scaling laws (NSL) refer to the phenomenon where model performance improves with scale. Sharma & Kaplan analyzed NSL using approximation theory and predict that MSE losses decay as $N^{-α}$, $α=4/d$, where $N$ is the number of model parameters, and $d$ is the intrinsic input dimension. Although their theory works well for some cases (e.g., ReLU networks), we surprisingly find that a simple 1D problem $y=x^2$ manifests a different scaling law ($α=1$) from their predictions ($α=4$). We opened the neural networks and found that the new scaling law originates from lottery ticket ensembling: a wider network on average has more "lottery tickets", which are ensembled to reduce the variance of outputs. We support the ensembling mechanism by mechanistically interpreting single neural networks, as well as studying them statistically. We attribute the $N^{-1}$ scaling law to the "central limit theorem" of lottery tickets. Finally, we discuss its potential implications for large language models and statistical physics-type theories of learning.

A Neural Scaling Law from Lottery Ticket Ensembling

TL;DR

The paper investigates why model performance scales with size beyond classical NSL theories and introduces lottery ticket ensembling as a new mechanism. Using a minimal two-layer setup to fit

, it provides empirical evidence that wider networks contain more lottery tickets whose ensembling yields a central-limit-type

scaling, supported by symmetric-neuron observations and a formal variance-reduction argument. The key contributions are the identification of lottery tickets in wide nets, a central-limit-theorem–style justification for

scaling, and an analysis of ensembling versus synergy with implications for large models and physics-inspired learning theories. This work offers a new lens on scaling behavior, linking empirical ticket-level structure to macroscopic performance and suggesting avenues for pruning, distillation, and theory development in deep learning.

Abstract

Neural scaling laws (NSL) refer to the phenomenon where model performance improves with scale. Sharma & Kaplan analyzed NSL using approximation theory and predict that MSE losses decay as

, where

is the number of model parameters, and

is the intrinsic input dimension. Although their theory works well for some cases (e.g., ReLU networks), we surprisingly find that a simple 1D problem

manifests a different scaling law (

) from their predictions (

). We opened the neural networks and found that the new scaling law originates from lottery ticket ensembling: a wider network on average has more "lottery tickets", which are ensembled to reduce the variance of outputs. We support the ensembling mechanism by mechanistically interpreting single neural networks, as well as studying them statistically. We attribute the

scaling law to the "central limit theorem" of lottery tickets. Finally, we discuss its potential implications for large language models and statistical physics-type theories of learning.

Paper Structure (11 sections, 11 equations, 13 figures)

This paper contains 11 sections, 11 equations, 13 figures.

Introduction
A New Scaling Law Not Explained by Approximation Theory
The Old Tale: Approximating Functions On Data Manifold
Experiments: Discovery of A New Scaling Law
Mechanistically Understanding Lottery Tickets
A Central Limit Theorem of Lottery Tickets
Related works and Discussions
More examples
Algorithm discovery by parameter space clustering
Disentangling lottery tickets and their correlations
Distilling a narrow network from a wide network

Figures (13)

Figure 1: (a) The $\ell\propto N^{-4/d}$ scaling law from Sharma and Kaplan sharma2020neural can be understood from approximating a $d$-dimensional function while data points lie uniformly inside a hypercube. (b) Our simple setup is training a two-layer SiLU or (Leaky)ReLU network (one hidden layer with width $N$) to fit the squared function $y=x^2$. (c) A surprising $N^{-1}$ scaling emerges for SiLU networks and at the tail of ReLU networks, while sharma2020neural's prediction $N^{-4}$ only appears at the early stage of ReLU.
Figure 2: Evidence of lottery tickets. (a) For an extremely wide network $N$=10000, the distribution of weights and biases in the first layer display an intriguing symmetry, i.e., there exist symmetric neurons $(w,b)$ and $(-w,b)$. (b) We train a thousand $N=2$ networks independently and show the histogram of their losses. The histogram display a few peaks, suggesting existence of a few different local minima or "algorithms". We call the peak with lowest loss "lottery tickets". (c) We find lottery tickets to have symmetric neurons, which guarantee that the network represents an even function.
Figure 3: "Central limit theorem" of lottery tickets. For each width $N$, we train 1000 networks independently and plot their loss histograms. For small $N$, the distribution is multi-modal, i.e., shows more than one peaks; for large $N$, the distribution becomes more single-peaked.
Figure 4: Do the benefits of large widths come from plain ensembling or more complicated synergy of smaller subparts? In each plot "$N$" means a network with width $N$, "$N/2$"+"$N/2$" means two networks with width $N/2$ are ensembled. If these two loss histograms are different (e.g., $N=2,4$), this means more complicated synergy is in place beyond ensembling. If the two loss histograms are similar (e.g., $N=20,40$), this means the role of synergy is vanishing, and the benefit of larger widths solely comes from ensembling.
Figure 5: NN loss histograms for unary functions.
...and 8 more figures

A Neural Scaling Law from Lottery Ticket Ensembling

TL;DR

Abstract

A Neural Scaling Law from Lottery Ticket Ensembling

Authors

TL;DR

Abstract

Table of Contents

Figures (13)