Table of Contents
Fetching ...

Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process

Guy Blanc, Neha Gupta, Gregory Valiant, Paul Valiant

TL;DR

This work analyzes SGD with independent label noise trained to minimize an ℓ2 loss and characterizes the zero-training-error fixed points via an implicit regularizer reg(θ) = (1/n)∑i ||∇θ h(x_i, θ)||^2. It shows that a zero-error θ* is stable if reg has zero gradient in all directions that preserve zero error, and unstable (strongly repellent) otherwise, with the dynamics behaving like an Ornstein-Uhlenbeck process in parameter directions. The authors demonstrate the framework in three settings—matrix sensing, 1D two-layer ReLU nets, and two-layer sigmoid nets on a single datapoint—where the implicit regularization drives models toward ground-truth, simple piecewise-linear interpolations, or sparse representations, respectively. The results provide a unified explanation for why noisy SGD tends to yield simpler, more generalizable models and point to broader implications for algorithm design and theoretical understanding of deep learning generalization.

Abstract

We consider networks, trained via stochastic gradient descent to minimize $\ell_2$ loss, with the training labels perturbed by independent noise at each iteration. We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared $\ell_2$ norm of the gradient of the model with respect to the parameter vector, evaluated at each data point. This holds for networks of any connectivity, width, depth, and choice of activation function. We interpret this implicit regularization term for three simple settings: matrix sensing, two layer ReLU networks trained on one-dimensional data, and two layer networks with sigmoid activations trained on a single datapoint. For these settings, we show why this new and general implicit regularization effect drives the networks towards "simple" models.

Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process

TL;DR

This work analyzes SGD with independent label noise trained to minimize an ℓ2 loss and characterizes the zero-training-error fixed points via an implicit regularizer reg(θ) = (1/n)∑i ||∇θ h(x_i, θ)||^2. It shows that a zero-error θ* is stable if reg has zero gradient in all directions that preserve zero error, and unstable (strongly repellent) otherwise, with the dynamics behaving like an Ornstein-Uhlenbeck process in parameter directions. The authors demonstrate the framework in three settings—matrix sensing, 1D two-layer ReLU nets, and two-layer sigmoid nets on a single datapoint—where the implicit regularization drives models toward ground-truth, simple piecewise-linear interpolations, or sparse representations, respectively. The results provide a unified explanation for why noisy SGD tends to yield simpler, more generalizable models and point to broader implications for algorithm design and theoretical understanding of deep learning generalization.

Abstract

We consider networks, trained via stochastic gradient descent to minimize loss, with the training labels perturbed by independent noise at each iteration. We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared norm of the gradient of the model with respect to the parameter vector, evaluated at each data point. This holds for networks of any connectivity, width, depth, and choice of activation function. We interpret this implicit regularization term for three simple settings: matrix sensing, two layer ReLU networks trained on one-dimensional data, and two layer networks with sigmoid activations trained on a single datapoint. For these settings, we show why this new and general implicit regularization effect drives the networks towards "simple" models.

Paper Structure

This paper contains 28 sections, 8 theorems, 67 equations, 7 figures.

Key Result

Theorem 1

Given training data $(x_1,y_1),\ldots,(x_n,y_n)$, consider a model $h(x,\theta)$ with bounded derivatives up to $3^{\textrm{rd}}$ order, and consider the dynamics of SGD, with independent bounded label noise of constant variance, and $\ell_2$ loss function $\frac{1}{n}\sum_{i=1}^n (h(x_i,\theta)-y_i when restricted to the manifold of 0 training error.

Figures (7)

  • Figure 1: Illustration of the implicit regularization of SGD with label noise in the matrix sensing setting (see li2017algorithmic). Here, we are trying to recover a rank $r$$d\times d$ matrix $X^*=U^* U^{*\top}$ from $n=5dr$ linear measurements $A_1,\langle A_1,X^*\rangle,\ldots,A_n,\langle A_n,X^*\rangle$, via SGD both with and without label noise, with $r=5$ and $d=100$, and entries of $A_i$ chosen i.i.d. from the standard Gaussian. Plots depict the test and training error for training with and without i.i.d. $N(0,0.1)$ label noise, initializing $U_0=I_d$. (Similar results hold when $U_0$ is chosen with i.i.d. Gaussian entries.) For both training dynamics, the training error quickly converges to zero. The test error without label noise plateaus with large error, whereas the test error with label noise converges to zero, at a longer timescale, inversely proportional to the square of the learning rate, which is consistent with the theory.
  • Figure 2: Both plots depict 2-layer ReLU networks, randomly initialized and trained on the set of 12 points depicted. The left plot shows the final models resulting from training via SGD, for five random initializations. In all cases, the training error is 0, and the models have converged. The right plot shows the models resulting from training via SGD with independent label noise, for 10 random initializations. Theorem \ref{['thm:informal_1']} explains this behavior as a consequence of our general characterization of the implicit regularization effect that occurs when training via SGD with label noise, given in Theorem \ref{['thm:main']}. Interestingly, this implicit regularization does not occur (either in theory or in practice) for ReLU networks with only a single layer of trainable weights.
  • Figure 3: Plots depicting the training loss (red) and length of the curve corresponding to the trained model (blue) as a function of the number of iterations of training for 2-layer ReLU trained on one-dimensional labeled data. The left plot corresponds to SGD without the addition of label noise, and converges to a trained model with curve length $\approx 5.2$. The right plot depicts the training dynamics of SGD with independent label noise, illustrating that training first finds a model with close to zero training error, and then---at a much longer timescale---moves within the zero training error manifold to a "simpler" model with significantly smaller curve length of $\approx 4.3$. Our analysis of the implicit regularization of these dynamics explains why SGD with label noise favors simpler solutions, as well as why this "simplification" occurs at a longer timescale than the initial loss minimization.
  • Figure 4: The leftmost figure depicts the case where the middle datapoint lies between the intercepts of the ReLU units with opposing convexities. The solid line depicts the original function, and the dotted line depicts the function after the perturbation, which preserves the function values at all datapoints and decreases the regularization expression. The rightmost four plots depict the four possible types of ReLU units that could give rise to the function depicted in the left pane, together with the perturbations that realize the effect depicted in the left pane. For cases 2 and 3, the linear and bias units must also be adjusted to preserve the function values at the datapoints.
  • Figure 5: The plots show the change such that the function values at the datapoints are preserved and the regularization term strictly decreases.
  • ...and 2 more figures

Theorems & Definitions (17)

  • Theorem 1: informal
  • Theorem 2
  • Theorem 3
  • Definition 1
  • Definition 2
  • Lemma 4
  • proof
  • Lemma 5
  • proof
  • Lemma 6
  • ...and 7 more