Bayesian Double Descent
Nick Polson, Vadim Sokolov
TL;DR
This work provides a coherent Bayesian account of the double descent phenomenon, showing that the re-descending risk with increasing model complexity can be explained by the prior on parameters given the model and the marginal likelihood over models. It unifies Occam's razor with over-parameterization through Bayesian model selection, the Dickey-Savage density ratio, and global-local shrinkage priors, and demonstrates how these ideas extend to polynomial regression and neural network regression. The paper develops a principled framework linking evidence, priors, and shrinkage (e.g., generalized ridge regression) to explain when and why over-parameterized models can generalize well, and it discusses practical applications and computational considerations for Bayesian interpolation and NN regression. The findings have implications for model selection, hyperparameter regularization, and the interpretation of generalization in modern, high-capacity learning systems, with avenues for future research in adaptive priors and scalable inference in deep models.
Abstract
Double descent is a phenomenon of over-parameterized statistical models such as deep neural networks which have a re-descending property in their risk function. As the complexity of the model increases, risk exhibits a U-shaped region due to the traditional bias-variance trade-off, then as the number of parameters equals the number of observations and the model becomes one of interpolation where the risk can be unbounded and finally, in the over-parameterized region, it re-descends -- the double descent effect. Our goal is to show that this has a natural Bayesian interpretation. We also show that this is not in conflict with the traditional Occam's razor -- simpler models are preferred to complex ones, all else being equal. Our theoretical foundations use Bayesian model selection, the Dickey-Savage density ratio, and connect generalized ridge regression and global-local shrinkage methods with double descent. We illustrate our approach for high dimensional neural networks and provide detailed treatments of infinite Gaussian means models and non-parametric regression. Finally, we conclude with directions for future research.
