Table of Contents
Fetching ...

Fantastic Generalization Measures and Where to Find Them

Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, Samy Bengio

TL;DR

This paper tackles why deep networks generalize by conducting a large-scale, controlled study of 40 complexity measures across 2187 CIFAR-10 models (and additional SVHN results) generated by varying seven hyperparameters. It introduces robust evaluation tools, including Kendall-based rankings, Granulated Kendall, and a conditional independence test to probe causal relations between measures and generalization. The key finding is that sharpness-based measures, particularly PAC-Bayes bounds and magnitude-aware perturbation metrics (notably 1/alpha'), reliably predict generalization across diverse hyperparameters, while many norm- and spectral-based measures can fail or correlate negatively due to optimization randomness. The results highlight optimization dynamics as informative, emphasize the value of causal-style analyses over simple correlations, and provide guidance for developing more reliable generalization probes for deep learning systems.

Abstract

Generalization of deep networks has been of great interest in recent years, resulting in a number of theoretically and empirically motivated complexity measures. However, most papers proposing such measures study only a small set of models, leaving open the question of whether the conclusion drawn from those experiments would remain valid in other settings. We present the first large scale study of generalization in deep networks. We investigate more then 40 complexity measures taken from both theoretical bounds and empirical studies. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters. Hoping to uncover potentially causal relationships between each measure and generalization, we analyze carefully controlled experiments and show surprising failures of some measures as well as promising measures for further research.

Fantastic Generalization Measures and Where to Find Them

TL;DR

This paper tackles why deep networks generalize by conducting a large-scale, controlled study of 40 complexity measures across 2187 CIFAR-10 models (and additional SVHN results) generated by varying seven hyperparameters. It introduces robust evaluation tools, including Kendall-based rankings, Granulated Kendall, and a conditional independence test to probe causal relations between measures and generalization. The key finding is that sharpness-based measures, particularly PAC-Bayes bounds and magnitude-aware perturbation metrics (notably 1/alpha'), reliably predict generalization across diverse hyperparameters, while many norm- and spectral-based measures can fail or correlate negatively due to optimization randomness. The results highlight optimization dynamics as informative, emphasize the value of causal-style analyses over simple correlations, and provide guidance for developing more reliable generalization probes for deep learning systems.

Abstract

Generalization of deep networks has been of great interest in recent years, resulting in a number of theoretically and empirically motivated complexity measures. However, most papers proposing such measures study only a small set of models, leaving open the question of whether the conclusion drawn from those experiments would remain valid in other settings. We present the first large scale study of generalization in deep networks. We investigate more then 40 complexity measures taken from both theoretical bounds and empirical studies. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters. Hoping to uncover potentially causal relationships between each measure and generalization, we analyze carefully controlled experiments and show surprising failures of some measures as well as promising measures for further research.

Paper Structure

This paper contains 38 sections, 4 theorems, 49 equations, 4 figures, 10 tables, 3 algorithms.

Key Result

Theorem 1

Let $\mathcal{F}$ be the class of feed-forward networks with a fixed computation graph of depth $d$ and ReLU activations. Let $a_i$ and $q_i$ be the number of activations and parameters in layer $i$. Then VC-dimension of ${\mathcal{F}}$ can be bounded as follows:

Figures (4)

  • Figure 1: Left: Graph at initialization of IC algorithm. Middle: The ideal graph where the measure $\mu$ can directly explain observed generalization. Right: Graph for correlation where $\mu$ cannot explain observed generalization.
  • Figure 2: Left: Number of models with training accuracy above 0.99 for each hyperparameter type. Middle: Distribution of training cross-entropy; distribution of training error can be found in Fig. \ref{['fig:training-error']}. Right: Distribution of generalization gap.
  • Figure 3: Joint Probability table for a single ${\mathcal{S}}_{ab}$
  • Figure 4: Distribution of training error on the trained models.

Theorems & Definitions (4)

  • Theorem 1: bartlett19
  • Theorem 2
  • Theorem 3: pitas2017pac
  • Theorem 4