Table of Contents
Fetching ...

How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers

Gon Buzaglo, Itamar Harel, Mor Shpigel Nacson, Alon Brutzkus, Nathan Srebro, Daniel Soudry

TL;DR

This work investigates why over-parameterized neural networks generalize even when they perfectly fit training data. It shows that sampling a random NN from a uniform weight prior conditioned on interpolation yields good generalization if a narrow, under-parameterized teacher exists, due to redundancy in NN parameterizations creating a non-uniform, simpler induced function prior. The authors derive generalization bounds based on an effective sample complexity $\tilde{C}=-\log(\tilde{p})$, where $\tilde{p}$ is the probability a random NN matches the teacher; they obtain explicit bounds for quantized architectures (Vanilla FCN, Scaled-Neuron FCN, CNNs and SCNNs) and extensions to continuous nets via angular-margin assumptions. The results imply that learning with a random interpolator can be efficient when there exists a sufficiently narrow teacher, with sample complexity scaling roughly with teacher complexity and the quantization level, rather than the student width. This connects MDL/Occam ideas with neural-network redundancy and provides a pathway to understanding implicit biases arising from parameterization, with potential impact on model compression and architectural design.

Abstract

Background. A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one of its variants. However, recent empirical work examined the generalization of a random NN that interpolates the data: the NN was sampled from a seemingly uniform prior over the parameters, conditioned on that the NN perfectly classifies the training set. Interestingly, such a NN sample typically generalized as well as SGD-trained NNs. Contributions. We prove that such a random NN interpolator typically generalizes well if there exists an underlying narrow ``teacher NN'' that agrees with the labels. Specifically, we show that such a `flat' prior over the NN parameterization induces a rich prior over the NN functions, due to the redundancy in the NN structure. In particular, this creates a bias towards simpler functions, which require less relevant parameters to represent -- enabling learning with a sample complexity approximately proportional to the complexity of the teacher (roughly, the number of non-redundant parameters), rather than the student's.

How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers

TL;DR

This work investigates why over-parameterized neural networks generalize even when they perfectly fit training data. It shows that sampling a random NN from a uniform weight prior conditioned on interpolation yields good generalization if a narrow, under-parameterized teacher exists, due to redundancy in NN parameterizations creating a non-uniform, simpler induced function prior. The authors derive generalization bounds based on an effective sample complexity , where is the probability a random NN matches the teacher; they obtain explicit bounds for quantized architectures (Vanilla FCN, Scaled-Neuron FCN, CNNs and SCNNs) and extensions to continuous nets via angular-margin assumptions. The results imply that learning with a random interpolator can be efficient when there exists a sufficiently narrow teacher, with sample complexity scaling roughly with teacher complexity and the quantization level, rather than the student width. This connects MDL/Occam ideas with neural-network redundancy and provides a pathway to understanding implicit biases arising from parameterization, with potential impact on model compression and architectural design.

Abstract

Background. A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one of its variants. However, recent empirical work examined the generalization of a random NN that interpolates the data: the NN was sampled from a seemingly uniform prior over the parameters, conditioned on that the NN perfectly classifies the training set. Interestingly, such a NN sample typically generalized as well as SGD-trained NNs. Contributions. We prove that such a random NN interpolator typically generalizes well if there exists an underlying narrow ``teacher NN'' that agrees with the labels. Specifically, we show that such a `flat' prior over the NN parameterization induces a rich prior over the NN functions, due to the redundancy in the NN structure. In particular, this creates a bias towards simpler functions, which require less relevant parameters to represent -- enabling learning with a sample complexity approximately proportional to the complexity of the teacher (roughly, the number of non-redundant parameters), rather than the student's.
Paper Structure (32 sections, 37 theorems, 286 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 32 sections, 37 theorems, 286 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Lemma 3.2

Let $\varepsilon\in\left(0,1\right)$ and $\delta\in\left(0,\frac{1}{5}\right)$, and assume that $\tilde{p}<\frac{1}{2}$. For any $N$ larger than the sample complexity, we have that

Figures (3)

  • Figure 1: Illustration of vanilla and scaled neuron three-layer quantized teacher and student neural networks. Note that the visualization does not show the bias units. The proof of Theorem \ref{['lem:int_quantized']} relies on counting student networks which are functionally equivalent to the teacher network. Figure \ref{['fig: scaled teacher']} depicts a narrow teacher. In Figure \ref{['fig: sparse student']}, we visualize a FC student network that replicates the teacher by zeroing out all outgoing weights of any neuron that does not exist in the teacher. Specifically, the blue edges are weights identical to the teacher, and the orange edges are set to zero. Therefore, the white neurons do not affect the network output. In Figure \ref{['fig: scaled student']}, we visualize an SFC student network that replicates the teacher by setting the scaling parameter to zero (each zero marked with a red 'x') for any neuron that does not exist in the teacher. In both cases, the gray edges do not affect the function. In this specific example, we can see how the redundancy is higher in SFC than in the vanilla FC network, hinting at better generalization capabilities.
  • Figure 2: A two-dimensional illustration of the first layer angular margin. In \ref{['fig:a']}, we show how the angle $\alpha$ is defined for a single $\mathbf{w}_{i}^{\star}$. Note that $\alpha$ is defined as the minimal angle when considering all rows of $\mathbf{W}_{\star}^{\left(1\right)}$. In \ref{['fig:b']}, we illustrate how $\alpha$ margin creates a cone around $\mathbf{w}_{i}^{\star}$ in which any $\mathbf{w}_i$ results in the same activation pattern as $\mathbf{w}_{i}^{\star}$ (i.e., as the teacher) on a training set.
  • Figure 3: The density of the log of the ratio between $\beta$ and $\alpha$, for standard-Gaussian data, and a two-layer neural network with $\rho=0.01$ and $d_0=500$, $d_1=10,000$, $d_1^{\star}=1,000$. We sampled $50,000$ such datapoints and calculated $\alpha, \beta$ as the minimal angles as in \ref{['eq: angular margin']} and \ref{['eq:output angular margin']} for a randomly initialized model, for a total of $1,000$ times.

Theorems & Definitions (92)

  • Definition 3.1
  • Lemma 3.2: G&C (i.e. Posterior Sampling) Generalization
  • Corollary 3.3: Volume of Generalizing Interpolators
  • Remark 3.4
  • Definition 4.1: Vanilla FC
  • Definition 4.2: Scaled-neuron FC
  • Theorem 4.3: Main result for fully connected neural networks
  • Remark 4.4
  • Definition 5.2: First layer angular margin
  • Definition 5.3: Second layer angular margin
  • ...and 82 more