Understanding the Double Descent Phenomenon in Deep Learning

Marc Lafon; Alexandre Thomas

Understanding the Double Descent Phenomenon in Deep Learning

Marc Lafon, Alexandre Thomas

TL;DR

This tutorial explains the concept of double descent and its mechanisms, sets the classical statistical learning framework and introduces inductive biases that appear to have a key role in double descent by selecting, among the multiple interpolating solutions, a smooth empirical risk minimizer.

Abstract

Combining empirical risk minimization with capacity control is a classical strategy in machine learning when trying to control the generalization gap and avoid overfitting, as the model class capacity gets larger. Yet, in modern deep learning practice, very large over-parameterized models (e.g. neural networks) are optimized to fit perfectly the training data and still obtain great generalization performance. Past the interpolation point, increasing model complexity seems to actually lower the test error. In this tutorial, we explain the concept of double descent and its mechanisms. The first section sets the classical statistical learning framework and introduces the double descent phenomenon. By looking at a number of examples, section 2 introduces inductive biases that appear to have a key role in double descent by selecting, among the multiple interpolating solutions, a smooth empirical risk minimizer. Finally, section 3 explores the double descent with two linear models, and gives other points of view from recent related works.

Understanding the Double Descent Phenomenon in Deep Learning

TL;DR

Abstract

Paper Structure (19 sections, 12 theorems, 49 equations, 7 figures)

This paper contains 19 sections, 12 theorems, 49 equations, 7 figures.

Generalization error : classical view and modern practice
Definitions and results from statistical learning
Classical view
Modern practice
Inductive biases
Explicit inductive biases
Least Norm
Model architecture
Ensembling
Implicit Bias of gradient descent
Gradient descent in under-determined least squares problem
Gradient descent on separable data
The reasons behind double descent
Linear Regression with Gaussian Noise
Random Fourier Features
...and 4 more sections

Key Result

proposition 1

For any empirical risk minimizer $h_n^* \in \text{argmin}_{h \in \mathcal{H}} L_n(h)$, the estimation error verifies

Figures (7)

Figure 1: The classical risk curve arising from the bias-variance trade-off and the double descent risk curve with the observed modern interpolation regime. Taken from Belkin2019
Figure 2: All models are Resnet18s trained on CIFAR-10 with 15% label noise (training labels artificially made incorrect), data-augmentation, and Adam for up to 4K epochs. Taken from Nakkiran2019
Figure 3: Illustration of tight exponential tail property for different common loss functions. We can see that both exponential and logistic loss functions has a tight exponential tail. The hinge loss and 0-1 loss functions have been displayed for reference only.
Figure 4: Fitting degree $d$ Legendre polynomials (orange curve) to $n=20$ noisy samples (red dots), from a polynomial of degree 3 (blue curve). Gradient descent is used to minimize the squared error, which leads to the smallest norm solution (considering the norm of the vector of coefficients). Taken from blog_double_descent.
Figure 5: Risk curves as a function of model capacity.
...and 2 more figures

Theorems & Definitions (37)

definition 1: True risk
remark 1
definition 2: Empirical risk
definition 3: Bayes risk
definition 4: Consistency
remark 2
proposition 1
proof
remark 3
theorem 1: Vapnik-Chervonenkis inequality
...and 27 more

Understanding the Double Descent Phenomenon in Deep Learning

TL;DR

Abstract

Understanding the Double Descent Phenomenon in Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (37)