Table of Contents
Fetching ...

Distributional Generalization: A New Kind of Generalization

Preetum Nakkiran, Yamini Bansal

TL;DR

This work introduces Distributional Generalization, a fine-grained notion of generalization that compares train and test outputs as distributions rather than merely comparing average error. It formalizes this through a framework of indistinguishability and a broad family of tests, and proposes two concrete conjectures: Feature Calibration and Agreement. The authors prove a key case for 1-NN and provide extensive empirical evidence across neural networks, kernel methods, and decision trees, showing that interpolating classifiers often preserve structured aspects of the data distribution in their test outputs. They also explore extensions to non-interpolating methods and discuss limitations, ensembling, and open questions. Overall, the paper sheds new light on how interpolating models internalize and reproduce distributional properties of the data beyond conventional generalization metrics.

Abstract

We introduce a new notion of generalization -- Distributional Generalization -- which roughly states that outputs of a classifier at train and test time are close *as distributions*, as opposed to close in just their average error. For example, if we mislabel 30% of dogs as cats in the train set of CIFAR-10, then a ResNet trained to interpolation will in fact mislabel roughly 30% of dogs as cats on the *test set* as well, while leaving other classes unaffected. This behavior is not captured by classical generalization, which would only consider the average error and not the distribution of errors over the input domain. Our formal conjectures, which are much more general than this example, characterize the form of distributional generalization that can be expected in terms of problem parameters: model architecture, training procedure, number of samples, and data distribution. We give empirical evidence for these conjectures across a variety of domains in machine learning, including neural networks, kernel machines, and decision trees. Our results thus advance our empirical understanding of interpolating classifiers.

Distributional Generalization: A New Kind of Generalization

TL;DR

This work introduces Distributional Generalization, a fine-grained notion of generalization that compares train and test outputs as distributions rather than merely comparing average error. It formalizes this through a framework of indistinguishability and a broad family of tests, and proposes two concrete conjectures: Feature Calibration and Agreement. The authors prove a key case for 1-NN and provide extensive empirical evidence across neural networks, kernel methods, and decision trees, showing that interpolating classifiers often preserve structured aspects of the data distribution in their test outputs. They also explore extensions to non-interpolating methods and discuss limitations, ensembling, and open questions. Overall, the paper sheds new light on how interpolating models internalize and reproduce distributional properties of the data beyond conventional generalization metrics.

Abstract

We introduce a new notion of generalization -- Distributional Generalization -- which roughly states that outputs of a classifier at train and test time are close *as distributions*, as opposed to close in just their average error. For example, if we mislabel 30% of dogs as cats in the train set of CIFAR-10, then a ResNet trained to interpolation will in fact mislabel roughly 30% of dogs as cats on the *test set* as well, while leaving other classes unaffected. This behavior is not captured by classical generalization, which would only consider the average error and not the distribution of errors over the input domain. Our formal conjectures, which are much more general than this example, characterize the form of distributional generalization that can be expected in terms of problem parameters: model architecture, training procedure, number of samples, and data distribution. We give empirical evidence for these conjectures across a variety of domains in machine learning, including neural networks, kernel machines, and decision trees. Our results thus advance our empirical understanding of interpolating classifiers.

Paper Structure

This paper contains 56 sections, 2 theorems, 37 equations, 24 figures, 3 tables.

Key Result

Theorem 1

Let $\mathcal{D}$ be a distribution over $\mathcal{X} \times \mathcal{Y}$, and let $n \in \mathbb{N}$ be the number of train samples. Assume the following regularity condition holds: Sampling the nearest neighbor train point to a random test point yields (close to) a uniformly random test point. Tha Then, Conjecture conj:approx holds. For all $(\varepsilon, \mathrm{NN}, \mathcal{D}, n)$-distinguis

Figures (24)

  • Figure 1: The setup and result of Experiment \ref{['exp:intro1']}. The CIFAR-10 train set is labeled as either Animals or Objects, with label noise affecting only cats. A WideResNet-28-10 is then trained to 0 train error on this train set, and evaluated on the test set. The joint distribution of $(x, f(x))$ on the train set is close to $(x, f(x))$ on the test set. Full experimental details in Appendix \ref{['app:intro-exp-1']}.
  • Figure 2: Toy Example: Feature Calibration. Schematic of the distributions discussed in Section \ref{['sec:toy']}, showing a toy example of the Feature Calibration conjecture for several distinguishable features $L$.
  • Figure 3: Feature Calibration for Constant Partition $L$: The CIFAR-10 train and test sets are class rebalanced according to (A). Interpolating classifiers are trained on the train set, and we plot the class balance of their outputs on the test set. This roughly matches the class balance of the train set, even for poorly-generalizing classifiers.
  • Figure 4: Feature Calibration with original classes on CIFAR-10: We train a WRN-28-10 on the CIFAR-10 dataset where we mislabel class $0 \rightarrow 1$ with probability $p$. (A): Joint density of the distinguishable features $L$ (the original CIFAR-10 class) and the classification task labels $y$ on the train set for noise probability $p=0.4$. (B): Joint density of the original CIFAR-10 classes $L$ and the network outputs $f(x)$ on the test set. (C): Observed noise probability in the network outputs on the test set (the (1, 0) entry of the matrix in B) for varying noise probabilities $p$
  • Figure 5: Feature Calibration with random confusion matrix on CIFAR-10: Left: Joint density of labels $y$ and original class $L$ on the train set. Right: Joint density of classifier predictions $f(x)$ and original class $L$ on the test set, for a WideResNet28-10 trained to interpolation. These two joint densities are close, as predicted by Conjecture \ref{['conj:approx']}.
  • ...and 19 more figures

Theorems & Definitions (9)

  • Definition 1: $(\varepsilon, \mathcal{F}, \mathcal{D}, n)$-Distinguishable Feature
  • Conjecture 1: Feature Calibration
  • Theorem 1
  • Conjecture 2: Agreement Property
  • Conjecture 3: Generalized Feature Calibration, informal
  • Conjecture 4: Conditional Density Estimation, Informal
  • proof : Proof of Theorem \ref{['thm:Ltest']}
  • Theorem 2: Agreement Property
  • proof