Table of Contents
Fetching ...

The Interpolating Information Criterion for Overparameterized Models

Liam Hodgkinson, Chris van der Heide, Robert Salomone, Fred Roosta, Michael W. Mahoney

TL;DR

This work addresses model selection in overparameterized settings where interpolating estimators exist and traditional information criteria fail. It develops the Interpolating Information Criterion (IIC) by establishing a Bayesian duality between over- and underparameterized representations and applying a Laplace-type analysis on the interpolating manifold via the coarea area framework. The IIC combines a regularization term that reflects prior misspecification, a sharpness term tied to the Jacobian, and a curvature term comparing ambient and manifold curvature, plus a data-size correction, and specializes to closed forms in linear regression. Empirical results across linear, gamma, polynomial, and diagonal neural-network models show the IIC correlates with predictive losses and helps explain double-descent phenomena, providing a principled, prior-aware, non-asymptotic model selection tool for interpolating regimes.

Abstract

The problem of model selection is considered for the setting of interpolating estimators, where the number of model parameters exceeds the size of the dataset. Classical information criteria typically consider the large-data limit, penalizing model size. However, these criteria are not appropriate in modern settings where overparameterized models tend to perform well. For any overparameterized model, we show that there exists a dual underparameterized model that possesses the same marginal likelihood, thus establishing a form of Bayesian duality. This enables more classical methods to be used in the overparameterized setting, revealing the Interpolating Information Criterion, a measure of model quality that naturally incorporates the choice of prior into the model selection. Our new information criterion accounts for prior misspecification, geometric and spectral properties of the model, and is numerically consistent with known empirical and theoretical behavior in this regime.

The Interpolating Information Criterion for Overparameterized Models

TL;DR

This work addresses model selection in overparameterized settings where interpolating estimators exist and traditional information criteria fail. It develops the Interpolating Information Criterion (IIC) by establishing a Bayesian duality between over- and underparameterized representations and applying a Laplace-type analysis on the interpolating manifold via the coarea area framework. The IIC combines a regularization term that reflects prior misspecification, a sharpness term tied to the Jacobian, and a curvature term comparing ambient and manifold curvature, plus a data-size correction, and specializes to closed forms in linear regression. Empirical results across linear, gamma, polynomial, and diagonal neural-network models show the IIC correlates with predictive losses and helps explain double-descent phenomena, providing a principled, prior-aware, non-asymptotic model selection tool for interpolating regimes.

Abstract

The problem of model selection is considered for the setting of interpolating estimators, where the number of model parameters exceeds the size of the dataset. Classical information criteria typically consider the large-data limit, penalizing model size. However, these criteria are not appropriate in modern settings where overparameterized models tend to perform well. For any overparameterized model, we show that there exists a dual underparameterized model that possesses the same marginal likelihood, thus establishing a form of Bayesian duality. This enables more classical methods to be used in the overparameterized setting, revealing the Interpolating Information Criterion, a measure of model quality that naturally incorporates the choice of prior into the model selection. Our new information criterion accounts for prior misspecification, geometric and spectral properties of the model, and is numerically consistent with known empirical and theoretical behavior in this regime.
Paper Structure (18 sections, 17 theorems, 79 equations, 5 figures, 1 table)

This paper contains 18 sections, 17 theorems, 79 equations, 5 figures, 1 table.

Key Result

Lemma 1

Assume $R$ is bounded from below on $\Theta$. Any limit of a sequence of solutions $\theta_\gamma$ to eq:MAP as $\gamma \to 0^+$ is a solution to eq:INT.

Figures (5)

  • Figure 1: Visualizing the three primary terms of the IIC. (Left) The regularization term measures closeness of $\theta_0$ to the interpolating manifold. (Center) The sharpness term encourages flatter vs. sharper minima in the loss, as this suggests the region of small training loss (blue) overlaps more regions of small test loss (red)---see keskar2016large for a similar visualization in one-dimension. (Right) The curvature term penalizes regions where the vector normal to the prior is less stable than around its global minima, as this suggests more of the neighboring region along the manifold $\mathcal{M}$ (grey) falls outside the region of high prior probability (green).
  • Figure 2: Mean squared error (MSE; top) vs. (bottom) classical BIC (\ref{['eq:BICLinearReg']}), our novel IIC (\ref{['eq:IICLinearReg']}), and the BIC for ridge regression with ridge parameter $\lambda=0.1$(BIC-$\lambda$), for random Fourier features with varying number of attributes. Measures are averaged over $100$ iterations applied to random subsamples of $n=1000$ input-output pairs from the MNIST dataset lecun1998gradient. The underparameterized, critical (where BIC fails), and overparameterized regimes are highlighted blue, red, and yellow, respectively. Excluding the critical region, the combined BIC and IIC curve exhibits double-descent.
  • Figure 3: IIC (left) and log test negative log-likelihood (right) for gamma regression, averaged over 50,000 replications on synthetic data. The IIC strongly correlates with the negative log-likelihood (Spearman correlation across 256 uniformly sampled points is $r = 0.996$).
  • Figure 4: Mean squared error (MSE; black) vs. classical BIC (\ref{['eq:BICLinearReg']}; blue), and our novel IIC (\ref{['eq:IICLinearReg']}; orange), for polynomial regression with increasing degree fitting to Runge's function $f(x) = (1 + 25 x^2)^{-1}$ over 10 equally spaced nodes on $[-1,1]$. Polynomial regression does not exhibit double descent.
  • Figure 5: IIC (left), MSE (center), and an adjusted MSE loss involving the initalisation parameter $\alpha$ (right) for diagonal linear neural networks. The IIC strongly correlates with the adjusted MSE$_\alpha$.

Theorems & Definitions (33)

  • Definition 1
  • Lemma 1
  • proof
  • Lemma 2: Augmented Lagrangian Duality rockafellar2009variational
  • Remark : Generalized Linear Models
  • Proposition 1: Bayesian Duality
  • proof : Proof of Proposition \ref{['thm:DR']}
  • Proposition 2: Smoothness of the Dual Prior
  • proof
  • Lemma 3
  • ...and 23 more