A universal compression theory: Lottery ticket hypothesis and superpolynomial scaling laws
Hong-Yi Wang, Di Luo, Tomaso Poggio, Isaac L. Chuang, Liu Ziyin
TL;DR
The authors establish a universal compression theorem showing that permutation-invariant functions of $d$ objects can be losslessly represented by $O(\mathrm{polylog}(d))$ objects with vanishing error, using moment matching and a deep-set representation. This yields a dynamical lottery ticket hypothesis, proving that large neural networks can be compressed to polylogarithmic width without altering training dynamics, and it suggests substantial improvements to neural scaling laws by compressing both data and parameters. The theory blends multivariate symmetric-polynomial ideas (FTSP/Tchakaloff) with practical algorithms and supports numerical demonstrations of compression preserving learning dynamics and significantly accelerating scaling. The work points to practical compression strategies, potential speedups in training and data efficiency, and avenues for extending the framework to broader symmetry groups and scalable implementations.
Abstract
When training large-scale models, the performance typically scales with the number of parameters and the dataset size according to a slow power law. A fundamental theoretical and practical question is whether comparable performance can be achieved with significantly smaller models and substantially less data. In this work, we provide a positive and constructive answer. We prove that a generic permutation-invariant function of $d$ objects can be asymptotically compressed into a function of $\operatorname{polylog} d$ objects with vanishing error. This theorem yields two key implications: (Ia) a large neural network can be compressed to polylogarithmic width while preserving its learning dynamics; (Ib) a large dataset can be compressed to polylogarithmic size while leaving the loss landscape of the corresponding model unchanged. (Ia) directly establishes a proof of the \textit{dynamical} lottery ticket hypothesis, which states that any ordinary network can be strongly compressed such that the learning dynamics and result remain unchanged. (Ib) shows that a neural scaling law of the form $L\sim d^{-α}$ can be boosted to an arbitrarily fast power law decay, and ultimately to $\exp(-α' \sqrt[m]{d})$.
