Table of Contents
Fetching ...

More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory

James B. Simon, Dhruva Karkada, Nikhil Ghosh, Mikhail Belkin

TL;DR

This work provides a theoretical framework showing that, in RF regression, both more data and more features reduce test error when the ridge parameter is optimally tuned, implying infinite-width RF models are preferable. By deriving an omniscient risk estimate and invoking a Gaussian universality ansatz, the authors connect RF regression behavior to kernel ridge regression, and show that for tasks with powerlaw eigenstructure, overfitting can be obligatory: near-zero training error is often essential to achieve near-optimal test performance. They validate the theory with experiments on synthetic data and real vision tasks using convolutional NTK/CNTK kernels, demonstrating powerlaw spectra and interpolation-driven optimums align with observed performance. The results offer a coherent narrative for why overparameterization, overfitting, and more data can be beneficial in modern deep learning, and they point to a data/task-dependent regime where interpolation is not merely tolerated but required for optimal generalization.

Abstract

In our era of enormous neural networks, empirical progress has been driven by the philosophy that more is better. Recent deep learning practice has found repeatedly that larger model size, more data, and more computation (resulting in lower training loss) improves performance. In this paper, we give theoretical backing to these empirical observations by showing that these three properties hold in random feature (RF) regression, a class of models equivalent to shallow networks with only the last layer trained. Concretely, we first show that the test risk of RF regression decreases monotonically with both the number of features and the number of samples, provided the ridge penalty is tuned optimally. In particular, this implies that infinite width RF architectures are preferable to those of any finite width. We then proceed to demonstrate that, for a large class of tasks characterized by powerlaw eigenstructure, training to near-zero training loss is obligatory: near-optimal performance can only be achieved when the training error is much smaller than the test error. Grounding our theory in real-world data, we find empirically that standard computer vision tasks with convolutional neural tangent kernels clearly fall into this class. Taken together, our results tell a simple, testable story of the benefits of overparameterization, overfitting, and more data in random feature models.

More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory

TL;DR

This work provides a theoretical framework showing that, in RF regression, both more data and more features reduce test error when the ridge parameter is optimally tuned, implying infinite-width RF models are preferable. By deriving an omniscient risk estimate and invoking a Gaussian universality ansatz, the authors connect RF regression behavior to kernel ridge regression, and show that for tasks with powerlaw eigenstructure, overfitting can be obligatory: near-zero training error is often essential to achieve near-optimal test performance. They validate the theory with experiments on synthetic data and real vision tasks using convolutional NTK/CNTK kernels, demonstrating powerlaw spectra and interpolation-driven optimums align with observed performance. The results offer a coherent narrative for why overparameterization, overfitting, and more data can be beneficial in modern deep learning, and they point to a data/task-dependent regime where interpolation is not merely tolerated but required for optimal generalization.

Abstract

In our era of enormous neural networks, empirical progress has been driven by the philosophy that more is better. Recent deep learning practice has found repeatedly that larger model size, more data, and more computation (resulting in lower training loss) improves performance. In this paper, we give theoretical backing to these empirical observations by showing that these three properties hold in random feature (RF) regression, a class of models equivalent to shallow networks with only the last layer trained. Concretely, we first show that the test risk of RF regression decreases monotonically with both the number of features and the number of samples, provided the ridge penalty is tuned optimally. In particular, this implies that infinite width RF architectures are preferable to those of any finite width. We then proceed to demonstrate that, for a large class of tasks characterized by powerlaw eigenstructure, training to near-zero training loss is obligatory: near-optimal performance can only be achieved when the training error is much smaller than the test error. Grounding our theory in real-world data, we find empirically that standard computer vision tasks with convolutional neural tangent kernels clearly fall into this class. Taken together, our results tell a simple, testable story of the benefits of overparameterization, overfitting, and more data in random feature models.
Paper Structure (42 sections, 19 theorems, 79 equations, 9 figures)

This paper contains 42 sections, 19 theorems, 79 equations, 9 figures.

Key Result

Theorem 1

Let $\mathcal{E}_\textnormal{te}(n,k,\delta)$ denote $\mathcal{E}_\textnormal{te}$ with $n$ samples, $k$ features, and ridge $\delta$ with any task eigenstructure $\{\lambda_i\}_{i=1}^\infty, \{v_i\}_{i=1}^\infty$. Let $n' \ge n \ge 0$ and $k' \ge k \ge 0$. It holds that with strict inequality so long as $(n,k) \neq (n',k')$ and $\sum_i \lambda_i v_i^2 > 0$ (i.e., the target has nonzero learnable

Figures (9)

  • Figure 1: At optimal ridge, more features monotonically improves test performance. Train and test errors for RF regression closely match \ref{['eqn:rf_test_risk', 'eqn:rf_train_risk']} for both synthetic Gaussian data and CIFAR10 with random ReLU features. Plots show traces with $n=256$ samples and varying number of features $k$. See \ref{['app:rf_verification']} for experimental details and more plots.
  • Figure 2: Overfitting is obligatory in standard computer vision tasks. We run KRR with convolutional NTKs on three tasks using varying ridge parameter and label noise, measuring test error and the fitting ratio $\mathcal{R}_{\textnormal{tr}/\textnormal{te}}$. We then compare to theoretical predictions (c.f. \ref{['lemma:risk_in_terms_of_ratio']}) computed from measured powerlaw exponents $\hat{\alpha},\;\hat{\beta}$. When no noise is added, we observe that the optimal fitting ratio is $\mathcal{R}_{\textnormal{tr}/\textnormal{te}}^* \approx 0$ (blue vertical dotted line) and (near-)interpolation is required to achieve optimal error. These tasks have low intrinsic noise, and as label noise is added, $\mathcal{R}_{\textnormal{tr}/\textnormal{te}}^*$ becomes nonzero, as predicted by \ref{['cor:interpolation_optimality_condition']}. Curves with noise added are rescaled to preserve total task power. See \ref{['app:powerlaw_measurements']} for exponent measurements and \ref{['app:powerlaw_verification']} for full experimental details.
  • Figure 3: Empirical verification of the RF eigenframework. We plot various traces of train and test error, both experimental and theoretical as predicted by \ref{['eqn:rf_test_risk', 'eqn:rf_train_risk']}, for two random feature models. (top row, same as \ref{['fig:rf_regression_theory_matches_exp']}) We fix the trainset size $n=256$ and vary the number of features $k$. (bottom row) We fix the number of random features $k=256$ and vary the training set size $n$. Note that in this row, the classical underparametrized regime is to the right of the interpolation threshold, and the modern overparametrized regime is to the left.
  • Figure 4: For RF regression with synthetic data, we show heatmaps of average train and test MSE as a function of training set size $n$ and number of random features $k$. We vary the ridge parameter $\delta$ from underregularized (left column) to overregularized (right column). In the underregularized setting, the signature double descent peak (bright diagonal) separates the classical regime (upper triangle) from the modern interpolating regime (lower triangle). In the overregularized setting, the model fails to interpolate the training data even at low $n$. Our theory accurately captures these phenomena. Note: at each $n$, we use the same batch of random datasets for all $k$, resulting in horizontal stripes visible at low $n$ that may be ignored as artifacts.
  • Figure 5: For the four vision tasks studied (columns), we show our techniques for measuring $\alpha$ (first row) and $\beta$ (third row). We fit the powerlaw decay in the tails (solid line) and report the corresponding exponent measurements (text). These plots are generated by studying increasingly large $n\times n$ training-data kernel matrices. For visual comparison, we include the empirical eigenstructure (eigenvalues and squared eigencoefficients of the full training-data kernel matrix, second and fourth rows respectively), along with a powerlaw decay with our measured exponent (solid line). Note that linear fits to the empirical eigenstructure (rows two and four) would be worse than to our proxy measurements (rows one and three).
  • ...and 4 more figures

Theorems & Definitions (21)

  • Theorem 1: More is better for RF regression
  • Definition 1: Optimal ridge, test error, and fitting ratio
  • Definition 2: $\alpha,\beta$ powerlaw eigenstructure
  • Theorem 2
  • Corollary 1
  • Corollary 2
  • Proposition 1: Derivatives of implicit constants defined in \ref{['eqn:kappa_gamma_defs']}
  • Proposition 2: Monotonic improvement after the double-descent peak
  • Lemma 1: Continuum approximations to eigensums
  • Lemma 2: Zero-ridge implicit regularization
  • ...and 11 more