DNNs, Dataset Statistics, and Correlation Functions
Robert W. Batterman, James F. Woodward
TL;DR
This work argues that the remarkable generalization of DNNs in image tasks arises not solely from model capacity but from exploiting structured, higher-order correlations present in real-world datasets. It introduces a correlation-function methodology to connect mesoscale statistical structures in data to macroscopic, continuum-like behavior, drawing an analogy to representative volume elements in materials science. Empirical evidence from random matrix theory studies of datasets and weight matrices, plus demonstrations of higher-order correlations in MNIST, support the claim that DNNs learn and utilize N-point statistics beyond two-point covariances. The authors propose a research program focused on identifying and leveraging world-structure features that enable robust generalization, challenging purely worst-case SLT explanations. The practical impact is a shift toward data-centric explanations for generalization and toward methods that uncover and exploit dataset statistics at multiple scales.
Abstract
This paper argues that dataset structure is important in image recognition tasks (among other tasks). Specifically, we focus on the nature and genesis of correlational structure in the actual datasets upon which DNNs are trained. We argue that DNNs are implementing a widespread methodology in condensed matter physics and materials science that focuses on mesoscale correlation structures that live between fundamental atomic/molecular scales and continuum scales. Specifically, we argue that DNNs that are successful in image classification must be discovering high order correlation functions. It is well-known that DNNs successfully generalize in apparent contravention of standard statistical learning theory. We consider the implications of our discussion for this puzzle.
