Table of Contents
Fetching ...

The Implicit Bias of Depth: How Incremental Learning Drives Generalization

Daniel Gissin, Shai Shalev-Shwartz, Amit Daniely

TL;DR

This work addresses why deep networks generalize by proposing that gradient descent implicitly favors simple solutions through incremental learning. The authors formalize this phenomenon with a toy deep-linear model and derive explicit gradient-flow dynamics, revealing a dynamical depth separation: deeper models enable incremental learning under much milder initialization than shallow ones. They extend the theory to larger models, including matrix sensing, quadratic networks, and diagonal/convolutional linear nets, and corroborate with experiments showing persistent incremental learning across tasks. The findings suggest depth-induced dynamical biases toward low-rank and sparse solutions, offering insight into generalization that may extend to nonlinear networks.

Abstract

A leading hypothesis for the surprising generalization of neural networks is that the dynamics of gradient descent bias the model towards simple solutions, by searching through the solution space in an incremental order of complexity. We formally define the notion of incremental learning dynamics and derive the conditions on depth and initialization for which this phenomenon arises in deep linear models. Our main theoretical contribution is a dynamical depth separation result, proving that while shallow models can exhibit incremental learning dynamics, they require the initialization to be exponentially small for these dynamics to present themselves. However, once the model becomes deeper, the dependence becomes polynomial and incremental learning can arise in more natural settings. We complement our theoretical findings by experimenting with deep matrix sensing, quadratic neural networks and with binary classification using diagonal and convolutional linear networks, showing all of these models exhibit incremental learning.

The Implicit Bias of Depth: How Incremental Learning Drives Generalization

TL;DR

This work addresses why deep networks generalize by proposing that gradient descent implicitly favors simple solutions through incremental learning. The authors formalize this phenomenon with a toy deep-linear model and derive explicit gradient-flow dynamics, revealing a dynamical depth separation: deeper models enable incremental learning under much milder initialization than shallow ones. They extend the theory to larger models, including matrix sensing, quadratic networks, and diagonal/convolutional linear nets, and corroborate with experiments showing persistent incremental learning across tasks. The findings suggest depth-induced dynamical biases toward low-rank and sparse solutions, offering insight into generalization that may extend to nonlinear networks.

Abstract

A leading hypothesis for the surprising generalization of neural networks is that the dynamics of gradient descent bias the model towards simple solutions, by searching through the solution space in an incremental order of complexity. We formally define the notion of incremental learning dynamics and derive the conditions on depth and initialization for which this phenomenon arises in deep linear models. Our main theoretical contribution is a dynamical depth separation result, proving that while shallow models can exhibit incremental learning dynamics, they require the initialization to be exponentially small for these dynamics to present themselves. However, once the model becomes deeper, the dependence becomes polynomial and incremental learning can arise in more natural settings. We complement our theoretical findings by experimenting with deep matrix sensing, quadratic neural networks and with binary classification using diagonal and convolutional linear networks, showing all of these models exhibit incremental learning.

Paper Structure

This paper contains 26 sections, 11 theorems, 98 equations, 7 figures.

Key Result

Theorem 1

Minimizing the toy linear model described in equation eq:model with gradient flow over the depth normalized squared loss equation eq:loss, with Gaussian inputs and weights initialized as in equation eq:init and assuming $\sigma_{i}^{*} > 0$ leads to the following analytical solutions for different v

Figures (7)

  • Figure 1: Incremental learning dynamics in deep models. Each panel shows the evolution of the five largest values of $\sigma$, the parameters of the induced model. All models were trained using gradient descent with a small initialization and learning rate, on a small training set such that there are multiple possible solutions. In all cases, the deep parameterization of the models lead to "incremental learning", where the values are learned at different rates (larger values are learned first), leading to sparse solutions. (a) Depth 4 matrix sensing, $\sigma$ denotes singular values (see section \ref{['sec:matrix_sensing']}). (b) Quadratic networks, $\sigma$ denotes singular values (see section \ref{['sec:quadratic_nets']}). (c) Depth 3 diagonal networks, $\sigma$ denotes feature weights (see section \ref{['sec:classification']}). (d) Depth 3 circular-convolutional networks, $\sigma$ denotes amplitudes in the frequency domain of the feature weights (see appendix \ref{['app:convolutional_model']}).
  • Figure 2: Incremental learning dynamics in the toy model. Each panel shows the evolution of $\frac{\sigma_{i}(t)}{\sigma_{i}^{*}}$ for $\sigma_{i}^{*} \in \{12,6,4,3\}$ according to the analytical solutions in theorem \ref{['thm:analytical_solution']}, under different depths and initializations. The first column has all values converging at the same rate. Notice how the deep parameterization with small initialization leads to distinct phases of learning, where values are learned incrementally (bottom-right). The shallow model's much weaker incremental learning, even at small initialization scales (second column), is explained in theorem \ref{['thm:depth_separation']}.
  • Figure 3: Empirical comparison of the dynamics of the toy model to OMP. The toy model has a depth of $5$ and was initialized with a scale of 1e-4 and a learning rate of 3e-3. We compare the fraction of agreement between the sets of first $s$ features selected of the two algorithms for every given sparsity level $s$, averaged over 100 experiments (the shaded regions are empirical standard deviations). For example, for sparsity level $3$, we look at the sets of first $3$ features selected by each algorithm and calculate the fraction of them that appear in both sets.
  • Figure 4: Evolution of the top-$5$ singular values of the deep matrix sensing model, with Gaussian initialization with variance such that the initial singular values are in expectation 1e-4. The model's size and data are in $\mathbb{R}^{50 \times 50}$. The columns correspond to different parameterization depths, while the rows correspond to different dataset sizes. In both cases the problem is over-determined, since the number of examples is smaller than the number of parameters. Since the original matrix is rank-$4$, we can recognize an unsuccessful recovery when all five singular values are nonzero, as seen clearly for both depth-1 plots.
  • Figure 5: Quadratic model's evolution of top-$5$ singular values for a rank-4 labeling function. The rows correspond to whether or not a global bias is introduced to the model. The first two columns are for a large dataset (one optimal solution) and the last two columns are for a small dataset (over-determined problem). When a bias is introduced, it is initialized to it's optimal value at initialization. Note how without the bias, the singular values are learned together and there is over-shooting of the optimal singular value caused by the coupling of the dynamics of the singular values. For the small datasets, we see that the model with no bias reaches a solution with a larger rank. Once a global bias is introduced, the dynamics become more incremental as in the analysis of the variance loss. Note that in this case the solution obtained for the small dataset is the optimal low-rank solution.
  • ...and 2 more figures

Theorems & Definitions (20)

  • Theorem 1
  • proof : Proof
  • Definition 1
  • Theorem 2
  • proof : Proof sketch (the full proof is given in appendix \ref{['app:incremental_learning']})
  • Theorem 3
  • Theorem 4
  • Definition 2
  • Theorem 5
  • Theorem
  • ...and 10 more