Revisiting Optimism and Model Complexity in the Wake of Overparameterized Machine Learning

Pratik Patil; Jin-Hong Du; Ryan J. Tibshirani

Revisiting Optimism and Model Complexity in the Wake of Overparameterized Machine Learning

Pratik Patil, Jin-Hong Du, Ryan J. Tibshirani

TL;DR

This work rethinks model complexity under overparameterization by introducing random-X degrees of freedom, bridging classical fixed-X ideas with modern predictive settings. It defines two random-X df notions—intrinsic and emergent—via random-X optimism and matches them to least-squares references to quantify complexity for arbitrary predictors, including interpolators. The authors develop theory for ridge, ridgeless, lasso, and convex regularized estimators, derive asymptotic equivalents under proportional regimes, and validate via extensive experiments across regression families and distribution shifts. A key insight is that emergent df typically exceeds intrinsic df, reflecting bias contributions, and that random-X df map to random-X prediction error through a universal mapping function. The framework enables decomposition of df into components due to bias and covariate shift, offering a tractable lens to study generalization in high-dimensional, non-smooth, and interpolating models with practical estimators like ridge, lasso, and random forests.

Abstract

Common practice in modern machine learning involves fitting a large number of parameters relative to the number of observations. These overparameterized models can exhibit surprising generalization behavior, e.g., ``double descent'' in the prediction error curve when plotted against the raw number of model parameters, or another simplistic notion of complexity. In this paper, we revisit model complexity from first principles, by first reinterpreting and then extending the classical statistical concept of (effective) degrees of freedom. Whereas the classical definition is connected to fixed-X prediction error (in which prediction error is defined by averaging over the same, nonrandom covariate points as those used during training), our extension of degrees of freedom is connected to random-X prediction error (in which prediction error is averaged over a new, random sample from the covariate distribution). The random-X setting more naturally embodies modern machine learning problems, where highly complex models, even those complex enough to interpolate the training data, can still lead to desirable generalization performance under appropriate conditions. We demonstrate the utility of our proposed complexity measures through a mix of conceptual arguments, theory, and experiments, and illustrate how they can be used to interpret and compare arbitrary prediction models.

Revisiting Optimism and Model Complexity in the Wake of Overparameterized Machine Learning

TL;DR

Abstract

Paper Structure (73 sections, 16 theorems, 183 equations, 10 figures, 1 table)

This paper contains 73 sections, 16 theorems, 183 equations, 10 figures, 1 table.

Introduction
Summary and outline
New random-X measures of degrees of freedom.
Basic properties and theory for random-X degrees of freedom.
Numerical experiments for a diverse set of prediction models.
Decomposing degrees of freedom under distribution shift.
Related work
Model optimism and degrees of freedom.
Other complexity measures.
Preliminaries
Fixed-X and random-X prediction error
Fixed-X optimism and degrees of freedom
Limitations of classical degrees of freedom
Random-X degrees of freedom
Reinterpreting fixed-X degrees of freedom
...and 58 more sections

Key Result

Proposition 1

For each fixed $d \leq n$, let $\widetilde{X}_d \in \mathbb{R}^{n \times d}$ be an arbitrary feature matrix having linearly independent columns, and consider $\widehat{f}^{\mathrm{ls}}(\cdot; \widetilde{X}_d, y)$, the predictor from least squares regression of $y$ on $\widetilde{X}_d$, which we call Let us extend these reference values so that we may write for all nonnegative $d$, Given an arbitr

Figures (10)

Figure 1: An illustration using ridgeless least squares regression as the prediction model, trained on $n=100$ samples and $p$ features, where $p$ ranges from 1 to 300. The true conditional mean is a nonlinear function in the features, and hence adding more features to the working linear model helps its approximation capacity (The precise details are given in \ref{['app:data-models']}). In the left panel, we can see that the random-X prediction error curve exhibits "double descent" in $p$. In the middle panel, the classical (fixed-X) definition of degrees of freedom increases linearly for $p \leq n$, but then it flattens out at the trivial answer of $n$ degrees of freedom for all $p > n$. The "intrinsic" random-X degrees of freedom, one of two basic versions of random-X degrees of freedom to be defined later in \ref{['sec:proposal']}, is decreasing when $p > n$, indicating that the ridgeless interpolator is becoming less complex as the dimensionality grows. In the right panel, we plot the random-X prediction error as a function of random-X degrees of freedom. The interpretation: our proposed complexity measure maps every overparameterized model onto an equivalent underparameterized model, and the best-predicting model (which lies in the overparameterized regime) actually has relatively low complexity.
Figure 3: Plot of $\omega$ in \ref{['eq:omega']}, which maps from normalized optimism (optimism divided by $\sigma^2$) to normalized degrees of freedom (degrees of freedom divided by $n-1$).
Figure 4: Degrees of freedom of lasso predictors, parameterized by the average number of nonzero coefficients, in a problem setting with $n=200$, $p=30$, and sparsity level $s=10$.
Figure 6: Prediction error and degrees of freedom of random forest predictors, as we vary the number of trees $N_{\text{tree}}$ and the maximum number of leaves for each tree $N_{\text{leaf}}^{\max}$, in a problem with $n=2000$, $p=50$.
Figure 7: Prediction error and degrees of freedom for ridge regression, kNN, and random forests. In both rows, $n=200$ and $p=100$. The top row displays data drawn from a linear model, which favors ridge. The bottom displays data drawn from a model that favors random forests.
...and 5 more figures

Theorems & Definitions (21)

Proposition 1
proof
Definition 1
Definition 2
Theorem 2
proof
Proposition 3
Proposition 4
Proposition 5
Theorem 6
...and 11 more

Revisiting Optimism and Model Complexity in the Wake of Overparameterized Machine Learning

TL;DR

Abstract

Revisiting Optimism and Model Complexity in the Wake of Overparameterized Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (21)