Table of Contents
Fetching ...

Understanding the Double Descent Phenomenon in Deep Learning

Marc Lafon, Alexandre Thomas

TL;DR

This tutorial explains the concept of double descent and its mechanisms, sets the classical statistical learning framework and introduces inductive biases that appear to have a key role in double descent by selecting, among the multiple interpolating solutions, a smooth empirical risk minimizer.

Abstract

Combining empirical risk minimization with capacity control is a classical strategy in machine learning when trying to control the generalization gap and avoid overfitting, as the model class capacity gets larger. Yet, in modern deep learning practice, very large over-parameterized models (e.g. neural networks) are optimized to fit perfectly the training data and still obtain great generalization performance. Past the interpolation point, increasing model complexity seems to actually lower the test error. In this tutorial, we explain the concept of double descent and its mechanisms. The first section sets the classical statistical learning framework and introduces the double descent phenomenon. By looking at a number of examples, section 2 introduces inductive biases that appear to have a key role in double descent by selecting, among the multiple interpolating solutions, a smooth empirical risk minimizer. Finally, section 3 explores the double descent with two linear models, and gives other points of view from recent related works.

Understanding the Double Descent Phenomenon in Deep Learning

TL;DR

This tutorial explains the concept of double descent and its mechanisms, sets the classical statistical learning framework and introduces inductive biases that appear to have a key role in double descent by selecting, among the multiple interpolating solutions, a smooth empirical risk minimizer.

Abstract

Combining empirical risk minimization with capacity control is a classical strategy in machine learning when trying to control the generalization gap and avoid overfitting, as the model class capacity gets larger. Yet, in modern deep learning practice, very large over-parameterized models (e.g. neural networks) are optimized to fit perfectly the training data and still obtain great generalization performance. Past the interpolation point, increasing model complexity seems to actually lower the test error. In this tutorial, we explain the concept of double descent and its mechanisms. The first section sets the classical statistical learning framework and introduces the double descent phenomenon. By looking at a number of examples, section 2 introduces inductive biases that appear to have a key role in double descent by selecting, among the multiple interpolating solutions, a smooth empirical risk minimizer. Finally, section 3 explores the double descent with two linear models, and gives other points of view from recent related works.
Paper Structure (19 sections, 12 theorems, 49 equations, 7 figures)

This paper contains 19 sections, 12 theorems, 49 equations, 7 figures.

Key Result

proposition 1

For any empirical risk minimizer $h_n^* \in \text{argmin}_{h \in \mathcal{H}} L_n(h)$, the estimation error verifies

Figures (7)

  • Figure 1: The classical risk curve arising from the bias-variance trade-off and the double descent risk curve with the observed modern interpolation regime. Taken from Belkin2019
  • Figure 2: All models are Resnet18s trained on CIFAR-10 with 15% label noise (training labels artificially made incorrect), data-augmentation, and Adam for up to 4K epochs. Taken from Nakkiran2019
  • Figure 3: Illustration of tight exponential tail property for different common loss functions. We can see that both exponential and logistic loss functions has a tight exponential tail. The hinge loss and 0-1 loss functions have been displayed for reference only.
  • Figure 4: Fitting degree $d$ Legendre polynomials (orange curve) to $n=20$ noisy samples (red dots), from a polynomial of degree 3 (blue curve). Gradient descent is used to minimize the squared error, which leads to the smallest norm solution (considering the norm of the vector of coefficients). Taken from blog_double_descent.
  • Figure 5: Risk curves as a function of model capacity.
  • ...and 2 more figures

Theorems & Definitions (37)

  • definition 1: True risk
  • remark 1
  • definition 2: Empirical risk
  • definition 3: Bayes risk
  • definition 4: Consistency
  • remark 2
  • proposition 1
  • proof
  • remark 3
  • theorem 1: Vapnik-Chervonenkis inequality
  • ...and 27 more