Table of Contents
Fetching ...

The Pitfalls of Memorization: When Memorization Hurts Generalization

Reza Bayat, Mohammad Pezeshki, Elvis Dohmatob, David Lopez-Paz, Pascal Vincent

TL;DR

This work addresses how memorization interacts with spurious correlations to undermine generalization under distribution shifts. It formalizes the ERM setting and shows that memorization can cause models to rely on spurious features, leading to poor held-out performance even with zero training error. To mitigate this, the authors propose Memorization-Aware Training (MAT), which shifts logits using calibrated held-out predictions (via a per-example logit adjustment) and leverages Cross-Risk Minimization (XRM) to obtain held-out signals. MAT aims to promote invariant, distribution-generalizable features and demonstrates improved worst-group performance with reduced memorization, across multiple datasets and annotation regimes. The findings highlight that memorization is not universally harmful, but can be managed and harnessed to improve robustness in real-world, distribution-shifted settings, with potential implications for scalable, group-robust learning in diverse domains.

Abstract

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explanations.This behavior leads to poor generalization when the learned explanations rely on spurious correlations. In this work, we formalize the interplay between memorization and generalization, showing that spurious correlations would particularly lead to poor generalization when are combined with memorization. Memorization can reduce training loss to zero, leaving no incentive to learn robust, generalizable patterns. To address this, we propose memorization-aware training (MAT), which uses held-out predictions as a signal of memorization to shift a model's logits. MAT encourages learning robust patterns invariant across distributions, improving generalization under distribution shifts.

The Pitfalls of Memorization: When Memorization Hurts Generalization

TL;DR

This work addresses how memorization interacts with spurious correlations to undermine generalization under distribution shifts. It formalizes the ERM setting and shows that memorization can cause models to rely on spurious features, leading to poor held-out performance even with zero training error. To mitigate this, the authors propose Memorization-Aware Training (MAT), which shifts logits using calibrated held-out predictions (via a per-example logit adjustment) and leverages Cross-Risk Minimization (XRM) to obtain held-out signals. MAT aims to promote invariant, distribution-generalizable features and demonstrates improved worst-group performance with reduced memorization, across multiple datasets and annotation regimes. The findings highlight that memorization is not universally harmful, but can be managed and harnessed to improve robustness in real-world, distribution-shifted settings, with potential implications for scalable, group-robust learning in diverse domains.

Abstract

Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explanations.This behavior leads to poor generalization when the learned explanations rely on spurious correlations. In this work, we formalize the interplay between memorization and generalization, showing that spurious correlations would particularly lead to poor generalization when are combined with memorization. Memorization can reduce training loss to zero, leaving no incentive to learn robust, generalizable patterns. To address this, we propose memorization-aware training (MAT), which uses held-out predictions as a signal of memorization to shift a model's logits. MAT encourages learning robust patterns invariant across distributions, improving generalization under distribution shifts.

Paper Structure

This paper contains 43 sections, 5 theorems, 57 equations, 7 figures, 3 tables, 1 algorithm.

Key Result

Theorem 2.2

Consider a binary classification problem under the setup described in Setup setup:classification_w_memorization, where a linear model $f(\bm{x}; \bm{w}) = \bm{x}^\top \bm{w}$ is trained using ERM on a dataset $\mathcal{D}^{\text{tr}}$. Let $\widehat{\bm{w}}_{\text{ERM}} = (\widehat{w}_y, \widehat{w} The condition $d \gg \log n$ ensures that example-specific features from different samples are appr

Figures (7)

  • Figure 1: Illustration of two scenarios in the interpretable classification setup involving spurious correlations and memorization. The left panel represents a scenario without example-specific features ($\sigma_{\bm{\epsilon}} \rightarrow 0$), where memorization is not possible. In this case, the model trained with ERM initially learns the spurious feature $x_a$ serving the majority, but eventually adjusts the decision boundary to the core feature $x_y$, resulting in good generalization on minority test examples. The middle and right panels depict a scenario with example-specific features ($\sigma_{\bm{\epsilon}} > 0$), where memorization is possible. In the middle plot, the model trained with ERM fails to generalize as it memorizes the minorities using the example-specific features $\bm{\epsilon}$ leaving no more incentive for the model to learn the core feature. In contrast, the model trained with MAT successfully learns the invariant features, and generalizes well even in the presence of example-specific features.
  • Figure 2: Self-Influence estimation of the Waterbird groups by ERM and MAT. The distribution of self-influence scores is shown for both the majority and minority subpopulations (e.g., Waterbirds on water vs. Waterbirds on land). Models trained with ERM exhibit higher self-influence scores for minority subpopulations, suggesting increased memorization in these groups. In contrast, models trained with MAT show more uniform self-influence distributions across both majority and minority subpopulations. The rightmost plots display the proportion of samples in different self-influence intervals, with MAT producing a more balanced distribution compared to ERM. Further details can be found in Appendix \ref{['app:additional_influence_score_experiments']}.
  • Figure 3: Three types of memorization in regression models trained with different levels of example-specific features ($\sigma_{\bm{\epsilon}}$). The plots show the ERM-trained model $g(x) = g(x_y, \bm{\epsilon})$ (solid blue line) versus the true underlying function $f(x_y)$ (dashed gray line) and the noisy training examples. In all the three, the models are trained until the training loss goes below $10^{-6}$. Good memorization (Left, $\sigma_{\bm{\epsilon}} = 10^{-4}$): Model learns the true function $f(x_y)$ well but slightly memorizes residual noise in the training data using the input example-specific features $\bm{\epsilon}$. This type of memorization is benign, as it does not compromise generalization. Bad memorization (Middle, $\sigma_{\bm{\epsilon}} = 10^{-3}$): The model relies more on example-specific features than learning the true function $f(x_y)$, leading to partial learning of $f(x_y)$ and fitting of noise-dominated input features. This type of memorization impedes learning of generalizable patterns and is considered malign. Ugly memorization (Right, $\sigma_{\bm{\epsilon}} = 0.0$): Without example-specific features, the model overfits the training data, including label noise, resulting in a highly non-linear and complex model that fails to generalize to new data. This type is referred to as catastrophic overfitting.
  • Figure 4: Influence Matrix: Influence scores of all the train and target data point pairs. Setting $target = train$ and $i = j$ reveals the self-influence.
  • Figure 5: Weight vector trajectories for ERM (blue), reweighting (orange), and shifting (green) compared with the max-margin solution (black dot). Weight vector trajectories normalized as $w / \|w\| \times \log(t),$ where $t$ is the iteration step. The scaling by $\log(t)$ is done for better visualization only.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Theorem 2.2
  • Lemma F.1
  • proof
  • Lemma G.1
  • proof
  • Lemma G.2
  • proof
  • Lemma G.3
  • proof