Characterizing Structural Regularities of Labeled Data in Overparameterized Models
Ziheng Jiang, Chiyuan Zhang, Kunal Talwar, Michael C. Mozer
TL;DR
This paper defines a Consistency Profile and a scalar C-score to quantify per-instance generalization as training data grows, revealing a continuum between densely regular modes and sparse, ambiguous regions in data distributions. It provides an empirical estimation framework, analyzes proxies, and applies the method to MNIST, CIFAR-10/100, and ImageNet, uncovering meaningful structure such as mislabeled or outlier instances at one end and well-supported regular examples at the other. The authors compare distance-based and learning-speed proxies, finding that learning-speed metrics correlate best with the C-score and offer scalable diagnostics. They demonstrate practical uses in data pruning, outlier detection, and studying optimizer dynamics, and release code and precomputed scores to enable broader adoption.
Abstract
Humans are accustomed to environments that contain both regularities and exceptions. For example, at most gas stations, one pays prior to pumping, but the occasional rural station does not accept payment in advance. Likewise, deep neural networks can generalize across instances that share common patterns or structures, yet have the capacity to memorize rare or irregular forms. We analyze how individual instances are treated by a model via a consistency score. The score characterizes the expected accuracy for a held-out instance given training sets of varying size sampled from the data distribution. We obtain empirical estimates of this score for individual instances in multiple data sets, and we show that the score identifies out-of-distribution and mislabeled examples at one end of the continuum and strongly regular examples at the other end. We identify computationally inexpensive proxies to the consistency score using statistics collected during training. We show examples of potential applications to the analysis of deep-learning systems.
