Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process

Guy Blanc; Neha Gupta; Gregory Valiant; Paul Valiant

Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process

Guy Blanc, Neha Gupta, Gregory Valiant, Paul Valiant

TL;DR

This work analyzes SGD with independent label noise trained to minimize an ℓ2 loss and characterizes the zero-training-error fixed points via an implicit regularizer reg(θ) = (1/n)∑i ||∇θ h(x_i, θ)||^2. It shows that a zero-error θ* is stable if reg has zero gradient in all directions that preserve zero error, and unstable (strongly repellent) otherwise, with the dynamics behaving like an Ornstein-Uhlenbeck process in parameter directions. The authors demonstrate the framework in three settings—matrix sensing, 1D two-layer ReLU nets, and two-layer sigmoid nets on a single datapoint—where the implicit regularization drives models toward ground-truth, simple piecewise-linear interpolations, or sparse representations, respectively. The results provide a unified explanation for why noisy SGD tends to yield simpler, more generalizable models and point to broader implications for algorithm design and theoretical understanding of deep learning generalization.

Abstract

We consider networks, trained via stochastic gradient descent to minimize $\ell_2$ loss, with the training labels perturbed by independent noise at each iteration. We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared $\ell_2$ norm of the gradient of the model with respect to the parameter vector, evaluated at each data point. This holds for networks of any connectivity, width, depth, and choice of activation function. We interpret this implicit regularization term for three simple settings: matrix sensing, two layer ReLU networks trained on one-dimensional data, and two layer networks with sigmoid activations trained on a single datapoint. For these settings, we show why this new and general implicit regularization effect drives the networks towards "simple" models.

Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process

TL;DR

Abstract

We consider networks, trained via stochastic gradient descent to minimize

loss, with the training labels perturbed by independent noise at each iteration. We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared

norm of the gradient of the model with respect to the parameter vector, evaluated at each data point. This holds for networks of any connectivity, width, depth, and choice of activation function. We interpret this implicit regularization term for three simple settings: matrix sensing, two layer ReLU networks trained on one-dimensional data, and two layer networks with sigmoid activations trained on a single datapoint. For these settings, we show why this new and general implicit regularization effect drives the networks towards "simple" models.

Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process

TL;DR

Abstract

Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (17)