Table of Contents
Fetching ...

The $\varphi$ Curve: The Shape of Generalization through the Lens of Norm-based Capacity Control

Yichen Wang, Yudong Chen, Lorenzo Rosasco, Fanghui Liu

TL;DR

The paper reframes generalization through norm-based capacity rather than model size, using random features models to obtain precise learning-curve characterizations via deterministic equivalents. It establishes that the test risk can be captured by a norm-based quantity with a phase transition between under- and over-parameterization, and shows that double descent is not necessary under proper capacity measures. A linear relation between risk and norm emerges in the over-parameterized regime, while power-law settings yield explicit scaling laws, reinforcing that norm control—e.g., via regularization—shapes classical U-shaped generalization curves. These results offer a principled lens for understanding generalization in large, over-parameterized systems and provide new deterministic tools for broader applications, including scaling analyses and potential OOD studies.

Abstract

Understanding how the test risk scales with model complexity is a central question in machine learning. Classical theory is challenged by the learning curves observed for large over-parametrized deep networks. Capacity measures based on parameter count typically fail to account for these empirical observations. To tackle this challenge, we consider norm-based capacity measures and develop our study for random features based estimators, widely used as simplified theoretical models for more complex networks. In this context, we provide a precise characterization of how the estimator's norm concentrates and how it governs the associated test error. Our results show that the predicted learning curve admits a phase transition from under- to over-parameterization, but no double descent behavior. This confirms that more classical U-shaped behavior is recovered considering appropriate capacity measures based on models norms rather than size. From a technical point of view, we leverage deterministic equivalence as the key tool and further develop new deterministic quantities which are of independent interest.

The $\varphi$ Curve: The Shape of Generalization through the Lens of Norm-based Capacity Control

TL;DR

The paper reframes generalization through norm-based capacity rather than model size, using random features models to obtain precise learning-curve characterizations via deterministic equivalents. It establishes that the test risk can be captured by a norm-based quantity with a phase transition between under- and over-parameterization, and shows that double descent is not necessary under proper capacity measures. A linear relation between risk and norm emerges in the over-parameterized regime, while power-law settings yield explicit scaling laws, reinforcing that norm control—e.g., via regularization—shapes classical U-shaped generalization curves. These results offer a principled lens for understanding generalization in large, over-parameterized systems and provide new deterministic tools for broader applications, including scaling analyses and potential OOD studies.

Abstract

Understanding how the test risk scales with model complexity is a central question in machine learning. Classical theory is challenged by the learning curves observed for large over-parametrized deep networks. Capacity measures based on parameter count typically fail to account for these empirical observations. To tackle this challenge, we consider norm-based capacity measures and develop our study for random features based estimators, widely used as simplified theoretical models for more complex networks. In this context, we provide a precise characterization of how the estimator's norm concentrates and how it governs the associated test error. Our results show that the predicted learning curve admits a phase transition from under- to over-parameterization, but no double descent behavior. This confirms that more classical U-shaped behavior is recovered considering appropriate capacity measures based on models norms rather than size. From a technical point of view, we leverage deterministic equivalence as the key tool and further develop new deterministic quantities which are of independent interest.

Paper Structure

This paper contains 75 sections, 28 theorems, 340 equations, 25 figures, 5 tables.

Key Result

Theorem 3.1

Given RFMs in sec:preli, the bias-variance decomposition of its norm ${\mathbb E}_{\varepsilon}\|\hat{{\bm a}}\|_2^2$ is given by ${\mathbb E}_{\varepsilon}\|\hat{{\bm a}}\|_2^2 =: \mathcal{N}_{\lambda}^{\tt RFM} = \mathcal{B}_{\mathcal{N},\lambda}^{\tt RFM} + \mathcal{V}_{\mathcal{N},\lambda}^{\tt Under ass:concentrated_RFRR, we have the following asymptotic deterministic equivalents $\mathcal{B

Figures (25)

  • Figure 1: \ref{['fig:lecture_figure']} presents previous empirical observations from ngcs229 in the random feature model. \ref{['fig:RFM_result']} precisely characterize the learning curve from our theory and perfectly matches our experiments (shown by points) with training data $\{({\bm x}_i, y_i)\}_{i=1}^n$, with $n = 300$, sub-sampled from the MNIST lecun1998gradient with $d=748$. The feature map is defined as $\varphi({\bm x}, {\bm w}) = {\rm erf}(\langle {\bm x}, {\bm w}\rangle)$ with random initialization ${\bm w} \sim \mathcal{N}(0, {\bm I})$. Note that whether the curve is finally lower than before is different between \ref{['fig:lecture_figure']} and \ref{['fig:RFM_result']}, mainly because of data, see more discussion in \ref{['app:discussion_3']}.
  • Figure 2: The curves of bias and variance in RFMs are over model size $p$ in \ref{['fig:bias_variance_risk']} and over norm ${\mathbb E}_{\varepsilon}\|\hat{{\bm a}}\|_2^2$ in \ref{['fig:bias_variance_risk_norm']}, respectively. \ref{['fig:norm_vs_lambda_varying_lambda']} establishes a one-to-one correspondence between the norm and $\lambda$ for a fixed $p$ across varying $\lambda$ values. \ref{['fig:risk_vs_norm_varying_lambda']} examines the relationship between risk and norm under the same conditions. Training data $\{({\bm x}_i, y_i)\}_{i \in [n]}$, $n = 100$, sampled from the model $y_i = {\bm g}_i^{\!\top} {\bm \theta}_* + \varepsilon_i$, $\sigma^2 = 0.04$, ${\bm g}_i \sim \mathcal{N}(0, {\bm I})$, ${\bm f}_i \sim \mathcal{N}(0, {\bm \Lambda})$, with $\xi^2_k({\bm \Lambda})=k^{-3/2}$ and ${\bm \theta}_{*,k}=k^{-1}$.
  • Figure 3: \ref{['fig:rff_risk_vs_norm_approx_1']} and \ref{['fig:rff_risk_vs_norm_approx_2']}: Validation of \ref{['prop:relation_minnorm_powerlaw_rf']}. The solid line represents the result of the deterministic equivalents, well approximated by the red dashed line of \ref{['eq:RORFM']} in the over-parameterized regime, and the blue dashed line of \ref{['eq:RORFM']} when $p \to n$ in the under-parameterized regime. \ref{['fig:scaling_law_norm_based']}: The value of exponents $\gamma_n$ and $\gamma_{{\mathsf N}}$ in different regions (divided by $q$ and $\ell$) for $r \in (0, \frac{1}{2})$. Variance dominated region is colored by orange, yellow and brown, bias dominated region is colored by blue and green.
  • Figure 4: The relationship between the test risk $\mathsf{R}$, norm $\mathsf{N}$, their bias and variance ($\mathsf{B}_{\mathsf{R}}$, $\mathsf{V}_{\mathsf{R}}$, $\mathsf{B}_{\mathsf{N}}$, $\mathsf{V}_{\mathsf{N}}$), and the ratio $\gamma := \frac{d}{n}$ for linear regression model. Training data $\{({\bm x}_i, y_i)\}_{i \in [n]}$, $d = 1000$, sampled from a linear model $y_i = {\bm x}_i^{\!\top} {\bm \beta}_* + \varepsilon_i$, $\sigma^2 = 0.0004$, ${\bm x}_i \sim \mathcal{N}(0, {\bm \Sigma})$, with $\sigma_k({\bm \Sigma})=k^{-1}$, ${\bm \beta}_{*,k}=k^{-3/2}$. The ridge $\lambda = 0.005$. Note that in the under-parameterized regime ($d < n$), the bias of the test risk is zero.
  • Figure 5: Relationship between ${\mathsf R}^{\tt LS}_\lambda$ and ${\mathsf N}^{\tt LS}_\lambda$ under the linear model $y_i = {\bm x}_i^{\!\top} {\bm \beta}_* + \varepsilon_i$, with $d=500$, ${\bm \Sigma} = {\bm I}_d$, $\|{\bm \beta}_*\|_2^2=10$, and $\sigma^2 = 1$. The dashed line corresponds to the ridgeless regression curve.
  • ...and 20 more figures

Theorems & Definitions (49)

  • Theorem 3.1: Deterministic equivalence of $\mathcal{N}_{\lambda}^{\tt RFM}$
  • Corollary 3.2: Asymptotic deterministic equivalence of ${\mathsf N}_{0}^{\tt RFM}$
  • Proposition 4.1: Linear learning curve
  • Corollary 4.2: Relationship for min-$\ell_2$ norm interpolator under power law
  • Proposition 4.3
  • Definition B.1: Effective regularization
  • Definition B.2: Degrees of freedom
  • Proposition B.3
  • Proposition B.4
  • Proposition B.5
  • ...and 39 more