How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers
Gon Buzaglo, Itamar Harel, Mor Shpigel Nacson, Alon Brutzkus, Nathan Srebro, Daniel Soudry
TL;DR
This work investigates why over-parameterized neural networks generalize even when they perfectly fit training data. It shows that sampling a random NN from a uniform weight prior conditioned on interpolation yields good generalization if a narrow, under-parameterized teacher exists, due to redundancy in NN parameterizations creating a non-uniform, simpler induced function prior. The authors derive generalization bounds based on an effective sample complexity $\tilde{C}=-\log(\tilde{p})$, where $\tilde{p}$ is the probability a random NN matches the teacher; they obtain explicit bounds for quantized architectures (Vanilla FCN, Scaled-Neuron FCN, CNNs and SCNNs) and extensions to continuous nets via angular-margin assumptions. The results imply that learning with a random interpolator can be efficient when there exists a sufficiently narrow teacher, with sample complexity scaling roughly with teacher complexity and the quantization level, rather than the student width. This connects MDL/Occam ideas with neural-network redundancy and provides a pathway to understanding implicit biases arising from parameterization, with potential impact on model compression and architectural design.
Abstract
Background. A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one of its variants. However, recent empirical work examined the generalization of a random NN that interpolates the data: the NN was sampled from a seemingly uniform prior over the parameters, conditioned on that the NN perfectly classifies the training set. Interestingly, such a NN sample typically generalized as well as SGD-trained NNs. Contributions. We prove that such a random NN interpolator typically generalizes well if there exists an underlying narrow ``teacher NN'' that agrees with the labels. Specifically, we show that such a `flat' prior over the NN parameterization induces a rich prior over the NN functions, due to the redundancy in the NN structure. In particular, this creates a bias towards simpler functions, which require less relevant parameters to represent -- enabling learning with a sample complexity approximately proportional to the complexity of the teacher (roughly, the number of non-redundant parameters), rather than the student's.
