Table of Contents
Fetching ...

Bayesian Double Descent

Nick Polson, Vadim Sokolov

TL;DR

This work provides a coherent Bayesian account of the double descent phenomenon, showing that the re-descending risk with increasing model complexity can be explained by the prior on parameters given the model and the marginal likelihood over models. It unifies Occam's razor with over-parameterization through Bayesian model selection, the Dickey-Savage density ratio, and global-local shrinkage priors, and demonstrates how these ideas extend to polynomial regression and neural network regression. The paper develops a principled framework linking evidence, priors, and shrinkage (e.g., generalized ridge regression) to explain when and why over-parameterized models can generalize well, and it discusses practical applications and computational considerations for Bayesian interpolation and NN regression. The findings have implications for model selection, hyperparameter regularization, and the interpretation of generalization in modern, high-capacity learning systems, with avenues for future research in adaptive priors and scalable inference in deep models.

Abstract

Double descent is a phenomenon of over-parameterized statistical models such as deep neural networks which have a re-descending property in their risk function. As the complexity of the model increases, risk exhibits a U-shaped region due to the traditional bias-variance trade-off, then as the number of parameters equals the number of observations and the model becomes one of interpolation where the risk can be unbounded and finally, in the over-parameterized region, it re-descends -- the double descent effect. Our goal is to show that this has a natural Bayesian interpretation. We also show that this is not in conflict with the traditional Occam's razor -- simpler models are preferred to complex ones, all else being equal. Our theoretical foundations use Bayesian model selection, the Dickey-Savage density ratio, and connect generalized ridge regression and global-local shrinkage methods with double descent. We illustrate our approach for high dimensional neural networks and provide detailed treatments of infinite Gaussian means models and non-parametric regression. Finally, we conclude with directions for future research.

Bayesian Double Descent

TL;DR

This work provides a coherent Bayesian account of the double descent phenomenon, showing that the re-descending risk with increasing model complexity can be explained by the prior on parameters given the model and the marginal likelihood over models. It unifies Occam's razor with over-parameterization through Bayesian model selection, the Dickey-Savage density ratio, and global-local shrinkage priors, and demonstrates how these ideas extend to polynomial regression and neural network regression. The paper develops a principled framework linking evidence, priors, and shrinkage (e.g., generalized ridge regression) to explain when and why over-parameterized models can generalize well, and it discusses practical applications and computational considerations for Bayesian interpolation and NN regression. The findings have implications for model selection, hyperparameter regularization, and the interpretation of generalization in modern, high-capacity learning systems, with avenues for future research in adaptive priors and scalable inference in deep models.

Abstract

Double descent is a phenomenon of over-parameterized statistical models such as deep neural networks which have a re-descending property in their risk function. As the complexity of the model increases, risk exhibits a U-shaped region due to the traditional bias-variance trade-off, then as the number of parameters equals the number of observations and the model becomes one of interpolation where the risk can be unbounded and finally, in the over-parameterized region, it re-descends -- the double descent effect. Our goal is to show that this has a natural Bayesian interpretation. We also show that this is not in conflict with the traditional Occam's razor -- simpler models are preferred to complex ones, all else being equal. Our theoretical foundations use Bayesian model selection, the Dickey-Savage density ratio, and connect generalized ridge regression and global-local shrinkage methods with double descent. We illustrate our approach for high dimensional neural networks and provide detailed treatments of infinite Gaussian means models and non-parametric regression. Finally, we conclude with directions for future research.

Paper Structure

This paper contains 30 sections, 1 theorem, 94 equations, 5 figures.

Key Result

Theorem 3.1

To compute $P(m\mid y)$ using the Dickey-Savage density ratio, we also need to calculate $\hat{f}_m(x) = E \left[f(x,\theta_m \mid \theta_{M-m} = 0,y,M)\right]$.

Figures (5)

  • Figure 1: Stylized double descent curve showing the classical bias-variance trade-off region (left) and the re-descending risk in the over-parameterized regime (right). The interpolation threshold occurs when the number of parameters equals the number of observations.
  • Figure 2: Double Descent Phenomenon: Polynomial Regression with Different Degrees
  • Figure 3: Bias-Variance Trade-off: Training and Test MSE vs Model Complexity
  • Figure 4: Data generated from the model $y(x, \boldsymbol{\theta}) = f(x, \boldsymbol{\theta}) + \epsilon$, where $\epsilon_i \sim N(0,0.3^2)$. The red dots represent the true function $f(x, \boldsymbol{\theta})$, and the blue line shows the noisy observations.
  • Figure 5: Marginal likelihood for a model with true polynomial degree $p_{\text{true}}=10$ (example) and $N=20$ observations. The x-axis represents the assumed model complexity $p$ (degree of polynomial fit), and the y-axis shows the log marginal likelihood. The peak indicates the optimal model complexity as selected by the marginal likelihood.

Theorems & Definitions (3)

  • Theorem 3.1: Model Nesting and Computational Equivalence
  • proof
  • proof