Fantastic Generalization Measures and Where to Find Them
Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, Samy Bengio
TL;DR
This paper tackles why deep networks generalize by conducting a large-scale, controlled study of 40 complexity measures across 2187 CIFAR-10 models (and additional SVHN results) generated by varying seven hyperparameters. It introduces robust evaluation tools, including Kendall-based rankings, Granulated Kendall, and a conditional independence test to probe causal relations between measures and generalization. The key finding is that sharpness-based measures, particularly PAC-Bayes bounds and magnitude-aware perturbation metrics (notably 1/alpha'), reliably predict generalization across diverse hyperparameters, while many norm- and spectral-based measures can fail or correlate negatively due to optimization randomness. The results highlight optimization dynamics as informative, emphasize the value of causal-style analyses over simple correlations, and provide guidance for developing more reliable generalization probes for deep learning systems.
Abstract
Generalization of deep networks has been of great interest in recent years, resulting in a number of theoretically and empirically motivated complexity measures. However, most papers proposing such measures study only a small set of models, leaving open the question of whether the conclusion drawn from those experiments would remain valid in other settings. We present the first large scale study of generalization in deep networks. We investigate more then 40 complexity measures taken from both theoretical bounds and empirical studies. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters. Hoping to uncover potentially causal relationships between each measure and generalization, we analyze carefully controlled experiments and show surprising failures of some measures as well as promising measures for further research.
