Table of Contents
Fetching ...

Is Memorization Helpful or Harmful? Prior Information Sets the Threshold

Chen Cheng, Rina Foygel Barber

TL;DR

The paper studies memorization vs generalization in an overparameterized Bayesian linear model $y = X\theta + \sigma\tau$ with $\theta \sim \pi$, revealing that the optimal generalization behavior hinges on prior-driven information measures $\mathsf{J}_\pi$ (Fisher information) and $\mathsf{V}_\pi$ (variance). It derives sharp bounds on the Bayes estimator’s training error, $\mathsf{Train}(\widehat{\theta}_{\mathsf{B}})$, sandwiched between $\frac{\sigma^4}{\mathsf{V}_\pi + \sigma^2}$ and $\frac{\sigma^4}{\mathsf{J}_\pi^{-1} + \sigma^2}$, and shows asymptotics where memorization is necessary ($\sigma^2 \lesssim \mathsf{J}_\pi^{-1}$) or where overfitting is harmful ($\sigma^2 \gtrsim \mathsf{V}_\pi$). The Bayes estimator remains optimal for prediction, and the cost of using a suboptimal estimator is controlled by the discrepancy in training error relative to the Bayes benchmark. The authors illustrate these regimes with concrete priors—isotropic Gaussian, approximately low-rank, and mixture-of-sparse priors—using random-matrix theory to characterize $\mathsf{V}_\pi$, $\mathsf{J}_\pi$, and $\lambda_\Sigma$. A Tweedie-based representation links training error to Fisher information, and monotonicity results provide a principled view of when memorization or overfitting dominates, independent of dimensionality alone. Overall, the work provides a principled, prior-aware threshold framework for memorization in high-dimensional linear models with broad implications for understanding generalization in overparameterized settings.

Abstract

We examine the connection between training error and generalization error for arbitrary estimating procedures, working in an overparameterized linear model under general priors in a Bayesian setup. We find determining factors inherent to the prior distribution $π$, giving explicit conditions under which optimal generalization necessitates that the training error be (i) near interpolating relative to the noise size (i.e., memorization is necessary), or (ii) close to the noise level (i.e., overfitting is harmful). Remarkably, these phenomena occur when the noise reaches thresholds determined by the Fisher information and the variance parameters of the prior $π$.

Is Memorization Helpful or Harmful? Prior Information Sets the Threshold

TL;DR

The paper studies memorization vs generalization in an overparameterized Bayesian linear model with , revealing that the optimal generalization behavior hinges on prior-driven information measures (Fisher information) and (variance). It derives sharp bounds on the Bayes estimator’s training error, , sandwiched between and , and shows asymptotics where memorization is necessary () or where overfitting is harmful (). The Bayes estimator remains optimal for prediction, and the cost of using a suboptimal estimator is controlled by the discrepancy in training error relative to the Bayes benchmark. The authors illustrate these regimes with concrete priors—isotropic Gaussian, approximately low-rank, and mixture-of-sparse priors—using random-matrix theory to characterize , , and . A Tweedie-based representation links training error to Fisher information, and monotonicity results provide a principled view of when memorization or overfitting dominates, independent of dimensionality alone. Overall, the work provides a principled, prior-aware threshold framework for memorization in high-dimensional linear models with broad implications for understanding generalization in overparameterized settings.

Abstract

We examine the connection between training error and generalization error for arbitrary estimating procedures, working in an overparameterized linear model under general priors in a Bayesian setup. We find determining factors inherent to the prior distribution , giving explicit conditions under which optimal generalization necessitates that the training error be (i) near interpolating relative to the noise size (i.e., memorization is necessary), or (ii) close to the noise level (i.e., overfitting is harmful). Remarkably, these phenomena occur when the noise reaches thresholds determined by the Fisher information and the variance parameters of the prior .
Paper Structure (52 sections, 24 theorems, 149 equations, 3 figures)

This paper contains 52 sections, 24 theorems, 149 equations, 3 figures.

Key Result

Proposition 1

Under the setting and notation above, for any positive definite $\Sigma\in\mathbb{R}^{d\times d}$, $\widehat{\theta}_{\mathsf{B}}$ achieves the optimal prediction error, i.e., Moreover, letting $\lambda_\Sigma = \frac{1}{n}\|X \Sigma^{-\frac{1}{2}}\|^2$, for any estimator $\widehat{\theta}$, its excess prediction error satisfies

Figures (3)

  • Figure 1: An illustration of the results of Corollary \ref{['cor:trerr_too_high__thm:main']} (describing the regime where memorization is necessary), and Corollary \ref{['cor:trerr_too_low__thm:main']} (describing the regime where overfitting is harmful). Here $C>1$ is any constant.
  • Figure 2: An illustration of the phenomenon discussed in Section \ref{['sec:discuss_effective_dim']}. In the top row, we plot an isotropic prior, $\pi = \mathsf{N}(0,I_2)$, while the bottom row shows a prior that encourages approximate $1$-sparsity, $\pi = 0.5 \mathsf{N}(0,e_1e_1^\top + \eta e_2e_2^\top) + 0.5 \mathsf{N}(0,\eta e_1 e_1^\top + e_2 e_2^\top)$, for $\eta = 0.05$. However, when we zoom in to the neighborhood of a single point $(0,0.5)$, the two priors are both essentially constant.
  • Figure 3: Numerical simulations for $\pi' = 0.5 \mathsf{N}(-1, \eta) + 0.5 \mathsf{N}(1, \eta)$. From left to right, we plot $\mathsf{Train}(\sigma^2)$, $\mathsf{Train}(\sigma^2)/\sigma^2$ and $\mathsf{Train}(\sigma^2)/\sigma^4$ vs. $\sigma^2$.

Theorems & Definitions (25)

  • Proposition 1
  • Theorem 1
  • Theorem 2
  • Corollary 2.1
  • Corollary 2.2
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • Lemma 4.1
  • ...and 15 more