Near-Interpolators: Rapid Norm Growth and the Trade-Off between Interpolation and Generalization

Yutong Wang; Rishi Sonthalia; Wei Hu

Near-Interpolators: Rapid Norm Growth and the Trade-Off between Interpolation and Generalization

Yutong Wang, Rishi Sonthalia, Wei Hu

TL;DR

The characterization reveals that larger norm scaling exponents $\alpha$ correspond to worse trade-offs between interpolation and generalization, and proves empirically that a similar phenomenon holds for nearly-interpolating shallow neural networks.

Abstract

We study the generalization capability of nearly-interpolating linear regressors: $\boldsymbolβ$'s whose training error $τ$ is positive but small, i.e., below the noise floor. Under a random matrix theoretic assumption on the data distribution and an eigendecay assumption on the data covariance matrix $\boldsymbolΣ$, we demonstrate that any near-interpolator exhibits rapid norm growth: for $τ$ fixed, $\boldsymbolβ$ has squared $\ell_2$-norm $\mathbb{E}[\|{\boldsymbolβ}\|_{2}^{2}] = Ω(n^α)$ where $n$ is the number of samples and $α>1$ is the exponent of the eigendecay, i.e., $λ_i(\boldsymbolΣ) \sim i^{-α}$. This implies that existing data-independent norm-based bounds are necessarily loose. On the other hand, in the same regime we precisely characterize the asymptotic trade-off between interpolation and generalization. Our characterization reveals that larger norm scaling exponents $α$ correspond to worse trade-offs between interpolation and generalization. We verify empirically that a similar phenomenon holds for nearly-interpolating shallow neural networks.

Near-Interpolators: Rapid Norm Growth and the Trade-Off between Interpolation and Generalization

TL;DR

The characterization reveals that larger norm scaling exponents

correspond to worse trade-offs between interpolation and generalization, and proves empirically that a similar phenomenon holds for nearly-interpolating shallow neural networks.

Abstract

We study the generalization capability of nearly-interpolating linear regressors:

's whose training error

is positive but small, i.e., below the noise floor. Under a random matrix theoretic assumption on the data distribution and an eigendecay assumption on the data covariance matrix

, we demonstrate that any near-interpolator exhibits rapid norm growth: for

fixed,

has squared

-norm

where

is the number of samples and

is the exponent of the eigendecay, i.e.,

. This implies that existing data-independent norm-based bounds are necessarily loose. On the other hand, in the same regime we precisely characterize the asymptotic trade-off between interpolation and generalization. Our characterization reveals that larger norm scaling exponents

correspond to worse trade-offs between interpolation and generalization. We verify empirically that a similar phenomenon holds for nearly-interpolating shallow neural networks.

Paper Structure (25 sections, 16 theorems, 73 equations, 6 figures)

This paper contains 25 sections, 16 theorems, 73 equations, 6 figures.

INTRODUCTION
Related works
Notations
Our contributions
Organization
PRIMER ON RANDOM MATRIX THEORY
INTERPOLATION-GENERALIZATION TRADE-OFF
RAPID NORM GROWTH
$\alpha$-scaled limiting spectral distribution
Looseness of norm-based generalization bounds
EXPERIMENTS
Experiments on synthetic data
Experiments on UCI datasets
ADDITIONAL RELATED WORKS AND NOVELTY OF OUR WORK
DISCUSSION AND LIMITATIONS
...and 10 more sections

Key Result

Theorem 1.5

Let $\tau \in (0,\sigma^{2})$ be arbitrary. Suppose that $\sup_{n=1,2\dots}\|\bm{\beta}^{\star}\|_{2} < +\infty$, assumption:exact-EVD holds, and $X = \bm{\Sigma}^{1/2} Z$ where $Z \sim \mathcal{N}(0,\mathbf{I}_{p})$. There exists unique number $k \in \mathbb{R}_{>0}$ such that that the following ho and let $\varrho_{n} := r n^{-\alpha}$. Then $\{\hat{\bm{\beta}}_{\varrho_{n}}\}_{n}$ is an asympto

Figures (6)

Figure 1: Left: Synthetic experiments validating the norm lower bound of norms of $0.2$-near-interpolators given by \ref{['theorem:polynomial-lower-bound']}. The squared norms are fitted by least squares (in log-log space) to estimate the norm-growth exponent $\alpha$ using only data points. See \ref{['section:experiment']} for additional experiment details. Right: Trade-off between the testing and training errors from \ref{['theorem:trade-off']}. The solid curves are the parametrized curves whose $(x,y)$-coordinates are $( \mathcal{E}_{\mathtt{train}}^{\ast}, \mathcal{E}_{\mathtt{test}}^{\ast} )$ and parametrized by $k$ (which is in 1-to-1 correspondence with $r$ see Theorem \ref{['theorem:trade-off']}). The scatter points, subsampled for visualization, denote ridge regression run results on the HDA model (\ref{['example:HDA']}). The colored ribbons denote the 20-80 quantiles for the scatter points. The horizontal dotted line denotes the noise $\sigma^2$ which is set to $1$ without the loss of generality.
Figure 2: Larger power-law spectra exponent implies larger asymptotic excess test error when interpolating to $5\%$ of the noise compared to $50\%$ of the noise floor. Let $\mathcal{E}^{\ast}_{\mathtt{test}}=\mathcal{E}^{\ast}_{\mathtt{test}}(\alpha, \gamma_\ast, \tau)$ be as in \ref{['theorem:trade-off']} where we make the dependency on parameters $\alpha, \gamma_\ast, \tau$ explicit. The color and contour line of plot shows the ratio of test errors at two levels of nearness of interpolation $\mathcal{E}^{\ast}_{\mathtt{test}}(\alpha, \gamma_\ast, 0.05) /\mathcal{E}^{\ast}_{\mathtt{test}}(\alpha, \gamma_\ast, 0.5)$ over an $(\alpha,\gamma_\ast)$-grid.
Figure 3: The $\mathcal{R}(k)$ function from \ref{['proposition:trade-off']}. The $x$-axis is the input $k$. Note that for $k < k_{\mathtt{crit}}$ the regularizer $r$ is negative. Although we are only interested in the $(k_{\mathtt{crit}}, +\infty)$ portion, negative regularizers have been studied by tsigler2020benignwu2020optimal.
Figure 4: Experiments with neural networks (top row: 1-hidden layer, bottom row: 5-hidden layers). Analogous to \ref{['fig:experiment-norms']}. See \ref{['section:experiment']} and \ref{['remark:NN-experiments']} for details.
Figure 5: Left. Training/testing error trade-off on the "forest" dataset from the UCI regression dataset collection using kernel ridge regression with the neural tangent kernel. Each curve is labeled by "DatasetName.d-f" where "d" and "f" represents the number of layers and the number of fixed layers in the NTK corresponding to ReLU networks. Right. The eigenvalue index vs eigenvalue plot of the NTK matrix exhibits power-law spectra. A tiny value is added to the eigenvalues for better visualization on the log-scale.
...and 1 more figures

Theorems & Definitions (48)

Definition 1.1
Definition 1.2
Definition 1.3
Theorem 1.5: Exact trade-off formula
Theorem 1.6: Rapid norm growth
Proposition 1.7: Rapid norm growth - generic
Remark 1.8
Remark 1.9: Effective-factor
Definition 2.1: Empirical spectral measure
Remark 2.3: Comparison with standard LSD
...and 38 more

Near-Interpolators: Rapid Norm Growth and the Trade-Off between Interpolation and Generalization

TL;DR

Abstract

Near-Interpolators: Rapid Norm Growth and the Trade-Off between Interpolation and Generalization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (48)