Distributional Generalization: A New Kind of Generalization
Preetum Nakkiran, Yamini Bansal
TL;DR
This work introduces Distributional Generalization, a fine-grained notion of generalization that compares train and test outputs as distributions rather than merely comparing average error. It formalizes this through a framework of indistinguishability and a broad family of tests, and proposes two concrete conjectures: Feature Calibration and Agreement. The authors prove a key case for 1-NN and provide extensive empirical evidence across neural networks, kernel methods, and decision trees, showing that interpolating classifiers often preserve structured aspects of the data distribution in their test outputs. They also explore extensions to non-interpolating methods and discuss limitations, ensembling, and open questions. Overall, the paper sheds new light on how interpolating models internalize and reproduce distributional properties of the data beyond conventional generalization metrics.
Abstract
We introduce a new notion of generalization -- Distributional Generalization -- which roughly states that outputs of a classifier at train and test time are close *as distributions*, as opposed to close in just their average error. For example, if we mislabel 30% of dogs as cats in the train set of CIFAR-10, then a ResNet trained to interpolation will in fact mislabel roughly 30% of dogs as cats on the *test set* as well, while leaving other classes unaffected. This behavior is not captured by classical generalization, which would only consider the average error and not the distribution of errors over the input domain. Our formal conjectures, which are much more general than this example, characterize the form of distributional generalization that can be expected in terms of problem parameters: model architecture, training procedure, number of samples, and data distribution. We give empirical evidence for these conjectures across a variety of domains in machine learning, including neural networks, kernel machines, and decision trees. Our results thus advance our empirical understanding of interpolating classifiers.
