Table of Contents
Fetching ...

A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws

Hong-Yi Wang, Di Luo, Tomaso Poggio, Isaac L. Chuang, Liu Ziyin

TL;DR

The authors establish a universal compression theorem showing that permutation-invariant functions of $d$ objects can be losslessly represented by $O(\mathrm{polylog}(d))$ objects with vanishing error, using moment matching and a deep-set representation. This yields a dynamical lottery ticket hypothesis, proving that large neural networks can be compressed to polylogarithmic width without altering training dynamics, and it suggests substantial improvements to neural scaling laws by compressing both data and parameters. The theory blends multivariate symmetric-polynomial ideas (FTSP/Tchakaloff) with practical algorithms and supports numerical demonstrations of compression preserving learning dynamics and significantly accelerating scaling. The work points to practical compression strategies, potential speedups in training and data efficiency, and avenues for extending the framework to broader symmetry groups and scalable implementations.

Abstract

When training large-scale models, the performance typically scales with the number of parameters and the dataset size according to a slow power law. A fundamental theoretical and practical question is whether comparable performance can be achieved with significantly smaller models and substantially less data. In this work, we provide a positive and constructive answer. We prove that a generic permutation-invariant function of $d$ objects can be asymptotically compressed into a function of $\operatorname{polylog} d$ objects with vanishing error. This theorem yields two key implications: (Ia) a large neural network can be compressed to polylogarithmic width while preserving its learning dynamics; (Ib) a large dataset can be compressed to polylogarithmic size while leaving the loss landscape of the corresponding model unchanged. (Ia) directly establishes a proof of the \textit{dynamical} lottery ticket hypothesis, which states that any ordinary network can be strongly compressed such that the learning dynamics and result remain unchanged. (Ib) shows that a neural scaling law of the form $L\sim d^{-α}$ can be boosted to an arbitrarily fast power law decay, and ultimately to $\exp(-α' \sqrt[m]{d})$.

A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws

TL;DR

The authors establish a universal compression theorem showing that permutation-invariant functions of objects can be losslessly represented by objects with vanishing error, using moment matching and a deep-set representation. This yields a dynamical lottery ticket hypothesis, proving that large neural networks can be compressed to polylogarithmic width without altering training dynamics, and it suggests substantial improvements to neural scaling laws by compressing both data and parameters. The theory blends multivariate symmetric-polynomial ideas (FTSP/Tchakaloff) with practical algorithms and supports numerical demonstrations of compression preserving learning dynamics and significantly accelerating scaling. The work points to practical compression strategies, potential speedups in training and data efficiency, and avenues for extending the framework to broader symmetry groups and scalable implementations.

Abstract

When training large-scale models, the performance typically scales with the number of parameters and the dataset size according to a slow power law. A fundamental theoretical and practical question is whether comparable performance can be achieved with significantly smaller models and substantially less data. In this work, we provide a positive and constructive answer. We prove that a generic permutation-invariant function of objects can be asymptotically compressed into a function of objects with vanishing error. This theorem yields two key implications: (Ia) a large neural network can be compressed to polylogarithmic width while preserving its learning dynamics; (Ib) a large dataset can be compressed to polylogarithmic size while leaving the loss landscape of the corresponding model unchanged. (Ia) directly establishes a proof of the \textit{dynamical} lottery ticket hypothesis, which states that any ordinary network can be strongly compressed such that the learning dynamics and result remain unchanged. (Ib) shows that a neural scaling law of the form can be boosted to an arbitrarily fast power law decay, and ultimately to .

Paper Structure

This paper contains 28 sections, 11 theorems, 64 equations, 5 figures, 2 algorithms.

Key Result

Theorem 1

Let $\theta = (w_1, \dots, w_d)$ with $w_i \in \mathbb{R}^m$, and let $f(\theta)$ be a polynomial in all scalar components $w_{i,a}$. Then any symmetric function $f(\theta)$ can be expressed as a function of the moments $p_k$, $k \in [d]$, defined by

Figures (5)

  • Figure 1: (a) Illustration of the main idea behind the compressibility of neural networks and datasets. (1) Permutation symmetry allows a high-dimensional function to be decomposed into a composition of $d$ low-dimensional "objects" (dots in the figure). (2) When $d$ is large, these objects become crowded, and those lying in denser regions are essentially redundant; they can be compressed into $d' = O(\operatorname{polylog} d)$ objects. The potential curse of dimensionality can thus be mitigated, or even removed, when the underlying function is smooth—a lesson well known in nonparametric statistics. (b) Decomposing the linear weights of a neural network into "objects" of symmetric status.
  • Figure 2: Error scaling for compressing a general symmetric function (Eq. \ref{['eq:sigmoid_func']}) using the moment-matching method. (a--d): each point shows the error in $f$ after compressing $d \to \max([0.1d], N_{m,k})$ input objects. Matching higher-order moments leads to faster error decay. (e): $\alpha$ is the fitted exponent in $|f(\theta) - f(\theta')| \propto d^{-\alpha}$. The dashed lines indicate $(k+1)/m+0.5$, which show good agreement with the numerical results.
  • Figure 3: Compression of the training dataset in a teacher--student setup. Blue dashed line: training with the original dataset of size $d=10^4$; Orange line: training with a compressed dataset of size $10^3$, using order-$5$ moment matching. Green line: training with a size-$10^3$ subset of the original dataset. Each run uses a cosine annealing learning-rate scheduler, annealing from the value shown in the plot titles to $0$. Test MSE loss values are plotted every $10$ epochs. It is observed that learning with the compressed dataset closely approximates the original dataset, whereas learning with a naively subsampled dataset does not.
  • Figure 4: Dynamical LTH (Theorem \ref{['thm:DLTH']}). The demonstrated task is learning a bivariate function from noisy training data. (a) Ground-truth function $f(x_1, x_2) = J_6(20r)\cos(6\theta)$, known as a cylindrical harmonic. $(r,\theta)$ is the polar coordinate of $(x_1, x_2)$. (b--d) MSE loss vs epoch under three different update rules. Blue dashed line: randomly initialized network of width $10^4$; Orange line: compressed network of width $10^3$, using $k=5$ moment matching; Green line: random subnetwork of the $10^4$-width network, also of width $10^3$. Loss values are plotted every $50$ epochs. All runs employ a cosine annealing learning-rate scheduler. Batch size is $512$ for all cases, and for the three curves in each figure, we enforce identical trajectories of mini-batch choices.
  • Figure 5: Improving neural scaling laws through compression. (a) MSE loss of the teacher--student task after training on an original dataset of size $d$ vs a compressed dataset of size $d'$. (b) MSE loss of the cylindrical harmonic task after training a two-layer neural network of width $d$ versus its compressed counterpart of width $d'$. In both panels, we compress $d$ objects to $d' = [16\sqrt{d}]$ using $k=6$ moment matching. The exponent $\alpha$ is obtained by fitting $L \propto d^{-\alpha}$ or $d'^{-\alpha}$.

Theorems & Definitions (23)

  • Definition 1
  • Theorem 1: FTSP, Multivariate Variant
  • Theorem 2: tchakaloff1957formules
  • Definition 2
  • Theorem 3: Moment matching in a small ball
  • Theorem 4: Universal Compression
  • Theorem 5: Dynamical lottery ticket hypothesis
  • Theorem 6: Fundamental theorem of symmetric polynomials
  • proof
  • proof
  • ...and 13 more