Table of Contents
Fetching ...

Does Learning Require Memorization? A Short Tale about a Long Tail

Vitaly Feldman

TL;DR

The paper tackles why modern overparameterized learners memorize training labels, including mislabeled and rare instances, even when optimal generalization would not require such memorization. It delivers a simple yet formal long-tailed mixture model with a frequency prior and shows that, for singleton subpopulations, memorization is necessary to achieve near-optimal generalization, quantifiably tying non-fitting to excess error via a bound that depends on tail mass. The authors extend the discrete unstructured setting to general mixture models, illustrate the mechanism with local and linear classifiers, and connect memorization to stability and privacy, showing that differential privacy and model compression incur nontrivial costs especially for rare subgroups. Collectively, the results provide a principled account of the long-tail memorization phenomenon, with implications for privacy, generalization theory, and distribution-aware training strategies in real-world high-dimensional data.

Abstract

State-of-the-art results on image recognition tasks are achieved using over-parameterized learning algorithms that (nearly) perfectly fit the training set and are known to fit well even random labels. This tendency to memorize the labels of the training data is not explained by existing theoretical analyses. Memorization of the training data also presents significant privacy risks when the training data contains sensitive personal information and thus it is important to understand whether such memorization is necessary for accurate learning. We provide the first conceptual explanation and a theoretical model for this phenomenon. Specifically, we demonstrate that for natural data distributions memorization of labels is necessary for achieving close-to-optimal generalization error. Crucially, even labels of outliers and noisy labels need to be memorized. The model is motivated and supported by the results of several recent empirical works. In our model, data is sampled from a mixture of subpopulations and our results show that memorization is necessary whenever the distribution of subpopulation frequencies is long-tailed. Image and text data is known to be long-tailed and therefore our results establish a formal link between these empirical phenomena. Our results allow to quantify the cost of limiting memorization in learning and explain the disparate effects that privacy and model compression have on different subgroups.

Does Learning Require Memorization? A Short Tale about a Long Tail

TL;DR

The paper tackles why modern overparameterized learners memorize training labels, including mislabeled and rare instances, even when optimal generalization would not require such memorization. It delivers a simple yet formal long-tailed mixture model with a frequency prior and shows that, for singleton subpopulations, memorization is necessary to achieve near-optimal generalization, quantifiably tying non-fitting to excess error via a bound that depends on tail mass. The authors extend the discrete unstructured setting to general mixture models, illustrate the mechanism with local and linear classifiers, and connect memorization to stability and privacy, showing that differential privacy and model compression incur nontrivial costs especially for rare subgroups. Collectively, the results provide a principled account of the long-tail memorization phenomenon, with implications for privacy, generalization theory, and distribution-aware training strategies in real-world high-dimensional data.

Abstract

State-of-the-art results on image recognition tasks are achieved using over-parameterized learning algorithms that (nearly) perfectly fit the training set and are known to fit well even random labels. This tendency to memorize the labels of the training data is not explained by existing theoretical analyses. Memorization of the training data also presents significant privacy risks when the training data contains sensitive personal information and thus it is important to understand whether such memorization is necessary for accurate learning. We provide the first conceptual explanation and a theoretical model for this phenomenon. Specifically, we demonstrate that for natural data distributions memorization of labels is necessary for achieving close-to-optimal generalization error. Crucially, even labels of outliers and noisy labels need to be memorized. The model is motivated and supported by the results of several recent empirical works. In our model, data is sampled from a mixture of subpopulations and our results show that memorization is necessary whenever the distribution of subpopulation frequencies is long-tailed. Image and text data is known to be long-tailed and therefore our results establish a formal link between these empirical phenomena. Our results allow to quantify the cost of limiting memorization in learning and explain the disparate effects that privacy and model compression have on different subgroups.

Paper Structure

This paper contains 29 sections, 16 theorems, 86 equations, 2 figures.

Key Result

Lemma 2.1

For any frequency prior $\pi$, $x \in X$ and a sequence of points $V = (x_1,\ldots,x_n)\in X^n$ that includes $x$ exactly $\ell$ times, we have

Figures (2)

  • Figure 1: Long tail of class frequencies and subpopulation frequencies within classes. The figure is taken from zhu2014capturing with the authors' permission.
  • Figure 2: Hardest examples for a differentially private to predict accurately (among those accurately predicted by a non-private model) on the left vs the easiest ones on the right. Top row is for digit "3" from the MNIST dataset and the bottom row is for the class "plane" from the CIFAR-10 dataset. The figure is extracted from carlini2018prototypical with the authors' permission. Details of the training process can be found in the original work.

Theorems & Definitions (32)

  • Lemma 2.1
  • Definition 2.2
  • Theorem 2.3
  • proof
  • Theorem 2.4
  • proof
  • Lemma 2.5
  • Lemma 2.6
  • proof
  • Lemma 2.7
  • ...and 22 more