Table of Contents
Fetching ...

Deep linear networks for regression are implicitly regularized towards flat minima

Pierre Marion, Lénaïc Chizat

TL;DR

This paper shows an implicit regularization towards flat minima: the sharpness of the minimizer is no more than a constant times the lower bound, which depends on the condition number of the data covariance matrix, but not on width or depth.

Abstract

The largest eigenvalue of the Hessian, or sharpness, of neural networks is a key quantity to understand their optimization dynamics. In this paper, we study the sharpness of deep linear networks for univariate regression. Minimizers can have arbitrarily large sharpness, but not an arbitrarily small one. Indeed, we show a lower bound on the sharpness of minimizers, which grows linearly with depth. We then study the properties of the minimizer found by gradient flow, which is the limit of gradient descent with vanishing learning rate. We show an implicit regularization towards flat minima: the sharpness of the minimizer is no more than a constant times the lower bound. The constant depends on the condition number of the data covariance matrix, but not on width or depth. This result is proven both for a small-scale initialization and a residual initialization. Results of independent interest are shown in both cases. For small-scale initialization, we show that the learned weight matrices are approximately rank-one and that their singular vectors align. For residual initialization, convergence of the gradient flow for a Gaussian initialization of the residual network is proven. Numerical experiments illustrate our results and connect them to gradient descent with non-vanishing learning rate.

Deep linear networks for regression are implicitly regularized towards flat minima

TL;DR

This paper shows an implicit regularization towards flat minima: the sharpness of the minimizer is no more than a constant times the lower bound, which depends on the condition number of the data covariance matrix, but not on width or depth.

Abstract

The largest eigenvalue of the Hessian, or sharpness, of neural networks is a key quantity to understand their optimization dynamics. In this paper, we study the sharpness of deep linear networks for univariate regression. Minimizers can have arbitrarily large sharpness, but not an arbitrarily small one. Indeed, we show a lower bound on the sharpness of minimizers, which grows linearly with depth. We then study the properties of the minimizer found by gradient flow, which is the limit of gradient descent with vanishing learning rate. We show an implicit regularization towards flat minima: the sharpness of the minimizer is no more than a constant times the lower bound. The constant depends on the condition number of the data covariance matrix, but not on width or depth. This result is proven both for a small-scale initialization and a residual initialization. Results of independent interest are shown in both cases. For small-scale initialization, we show that the learned weight matrices are approximately rank-one and that their singular vectors align. For residual initialization, convergence of the gradient flow for a Gaussian initialization of the residual network is proven. Numerical experiments illustrate our results and connect them to gradient descent with non-vanishing learning rate.
Paper Structure (50 sections, 17 theorems, 336 equations, 7 figures)

This paper contains 50 sections, 17 theorems, 336 equations, 7 figures.

Key Result

Theorem 1

Let $X \in {\mathbb{R}}^{n \times d}$ be a design matrix and $y \in {\mathbb{R}}^n$ a target. Then the minimal sharpness $S_{\min}$ of any linear network $x \mapsto W_L \dots W_1 x$ of depth $L$ that implements the optimal linear regressor $w^\star \in {\mathbb{R}}^d$ satisfies where $\Lambda$ is the largest eigenvalue of the empirical covariance matrix $\hat{\Sigma} := \frac{1}{n} X^\top X$, and

Figures (7)

  • Figure 1: Training a deep linear network on a univariate regression task with quadratic loss. The weight matrices are initialized as Gaussian random variables, whose standard deviation is the x-axis of plots \ref{['fig:intro-left']} and \ref{['fig:intro-right']}. Experimental details are given in Appendix \ref{['apx:experimental-details']}.
  • Figure 2: Squared distance of the trained network to the empirical risk minimizer, for various learning rates and depth. For each depth, learning succeeds if the learning rate is below a threshold, which corresponds to the theoretical value $\frac{2}{S_{\min}} \simeq (\|w^\star\|_2^{2 - \frac{2}{L}} L a)^{-1}$ of Theorem \ref{['thm:intro']} (dashed vertical line).
  • Figure 3: Probability of divergence of gradient descent for a Gaussian initialization of the weight matrices, depending on the initialization scale and the learning rate.
  • Figure 4: Training a deep linear network on a univariate regression task with quadratic loss. The initialization is a residual initialization as in Section \ref{['sec:residual']}.
  • Figure 5: Probability of divergence of gradient descent for a residual initialization of the weight matrices, depending on the initialization scale and the learning rate.
  • ...and 2 more figures

Theorems & Definitions (23)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • Lemma 2
  • Theorem 3
  • Corollary 1
  • Corollary 2
  • Theorem 4
  • Lemma 3
  • Corollary 3
  • ...and 13 more