Is Memorization Helpful or Harmful? Prior Information Sets the Threshold
Chen Cheng, Rina Foygel Barber
TL;DR
The paper studies memorization vs generalization in an overparameterized Bayesian linear model $y = X\theta + \sigma\tau$ with $\theta \sim \pi$, revealing that the optimal generalization behavior hinges on prior-driven information measures $\mathsf{J}_\pi$ (Fisher information) and $\mathsf{V}_\pi$ (variance). It derives sharp bounds on the Bayes estimator’s training error, $\mathsf{Train}(\widehat{\theta}_{\mathsf{B}})$, sandwiched between $\frac{\sigma^4}{\mathsf{V}_\pi + \sigma^2}$ and $\frac{\sigma^4}{\mathsf{J}_\pi^{-1} + \sigma^2}$, and shows asymptotics where memorization is necessary ($\sigma^2 \lesssim \mathsf{J}_\pi^{-1}$) or where overfitting is harmful ($\sigma^2 \gtrsim \mathsf{V}_\pi$). The Bayes estimator remains optimal for prediction, and the cost of using a suboptimal estimator is controlled by the discrepancy in training error relative to the Bayes benchmark. The authors illustrate these regimes with concrete priors—isotropic Gaussian, approximately low-rank, and mixture-of-sparse priors—using random-matrix theory to characterize $\mathsf{V}_\pi$, $\mathsf{J}_\pi$, and $\lambda_\Sigma$. A Tweedie-based representation links training error to Fisher information, and monotonicity results provide a principled view of when memorization or overfitting dominates, independent of dimensionality alone. Overall, the work provides a principled, prior-aware threshold framework for memorization in high-dimensional linear models with broad implications for understanding generalization in overparameterized settings.
Abstract
We examine the connection between training error and generalization error for arbitrary estimating procedures, working in an overparameterized linear model under general priors in a Bayesian setup. We find determining factors inherent to the prior distribution $π$, giving explicit conditions under which optimal generalization necessitates that the training error be (i) near interpolating relative to the noise size (i.e., memorization is necessary), or (ii) close to the noise level (i.e., overfitting is harmful). Remarkably, these phenomena occur when the noise reaches thresholds determined by the Fisher information and the variance parameters of the prior $π$.
