Table of Contents
Fetching ...

Sensitivity and Generalization in Neural Networks: an Empirical Study

Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein

TL;DR

The study investigates why large neural networks generalize well by introducing sensitivity-based complexity metrics that quantify local input perturbation effects.It defines the input-output Jacobian Frobenius norm and a trajectory-based transition count to measure sensitivity around data, and conducts a large-scale empirical analysis across thousands of fully-connected networks on multiple image datasets.Key findings show strong correlations between reduced Jacobian-based sensitivity near the data manifold and better generalization, with regularization techniques and mini-batch SGD promoting robustness; per-point Jacobian values can also predict misclassification tendencies.The results offer a geometry-driven perspective on generalization and suggest robustness as a practical criterion for model selection, while outlining future work to extend to more complex architectures and tasks.

Abstract

In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models. In this work, we investigate this tension between complexity and generalization through an extensive empirical exploration of two natural metrics of complexity related to sensitivity to input perturbations. Our experiments survey thousands of models with various fully-connected architectures, optimizers, and other hyper-parameters, as well as four different image classification datasets. We find that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization. We further establish that factors associated with poor generalization $-$ such as full-batch training or using random labels $-$ correspond to lower robustness, while factors associated with good generalization $-$ such as data augmentation and ReLU non-linearities $-$ give rise to more robust functions. Finally, we demonstrate how the input-output Jacobian norm can be predictive of generalization at the level of individual test points.

Sensitivity and Generalization in Neural Networks: an Empirical Study

TL;DR

The study investigates why large neural networks generalize well by introducing sensitivity-based complexity metrics that quantify local input perturbation effects.It defines the input-output Jacobian Frobenius norm and a trajectory-based transition count to measure sensitivity around data, and conducts a large-scale empirical analysis across thousands of fully-connected networks on multiple image datasets.Key findings show strong correlations between reduced Jacobian-based sensitivity near the data manifold and better generalization, with regularization techniques and mini-batch SGD promoting robustness; per-point Jacobian values can also predict misclassification tendencies.The results offer a geometry-driven perspective on generalization and suggest robustness as a practical criterion for model selection, while outlining future work to extend to more complex architectures and tasks.

Abstract

In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models. In this work, we investigate this tension between complexity and generalization through an extensive empirical exploration of two natural metrics of complexity related to sensitivity to input perturbations. Our experiments survey thousands of models with various fully-connected architectures, optimizers, and other hyper-parameters, as well as four different image classification datasets. We find that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization. We further establish that factors associated with poor generalization such as full-batch training or using random labels correspond to lower robustness, while factors associated with good generalization such as data augmentation and ReLU non-linearities give rise to more robust functions. Finally, we demonstrate how the input-output Jacobian norm can be predictive of generalization at the level of individual test points.

Paper Structure

This paper contains 31 sections, 14 equations, 10 figures.

Figures (10)

  • Figure 1: $2160$ networks trained to 100% training accuracy on CIFAR10 (see §\ref{['app:Sensitivity and Generalization']} for experimental details). Left: while increasing capacity of the model allows for overfitting (top), very few models do, and a model with the maximum parameter count yields the best generalization (bottom right). Right: train loss does not correlate well with generalization, and the best model (minimum along the $y$-axis) has training loss many orders of magnitude higher than models that generalize worse (left). This observation rules out underfitting as the reason for poor generalization in low-capacity models. See Neyshabur2014InSO for similar findings in the case of achievable $0$ training loss.
  • Figure 2: A 100%-accurate (on training data) MNIST network implements a function that is much more stable near training data than away from it. Left: depiction of a hypothetical circular trajectory in input space passing through three digits of different classes, highlighting the training point locations ($\pi/3$, $\pi$, $5\pi/3$). Center: Jacobian norm as the input traverses an elliptical trajectory. Sensitivity drops significantly in the vicinity of training data while remaining uniform along random ellipses. Right: transition density behaves analogously. According to both metrics, as the input moves between points of different classes, the function becomes less stable than when it moves between points of the same class. This is consistent with the intuition that linear combinations of different digits lie further from the data manifold than those of same-class digits (which need not hold for more complex datasets). See §\ref{['app:Sensitivity along a Trajectory']} for experimental details.
  • Figure 3:
  • Figure 4: Improvement in generalization (left column) due to using correct labels, data augmentation, ReLUs, mini-batch optimization (top to bottom) is consistently coupled with reduced sensitivity as measured by the Jacobian norm (center column). Transitions (right column) correlate with generalization in all considered scenarios except for comparing optimizers (bottom right). Each point on the plot corresponds to two neural networks that share all hyper-parameters and the same optimization procedure, but differ in a certain property as indicated by axes titles. The coordinates along each axis reflect the values of the quantity in the title of the plot in the respective setting (i.e. with true or random labels). All networks have reached $100\%$ training accuracy on CIFAR10 in both settings (except for the data-augmentation study, second row; see §\ref{['app:Sensitivity and Generalization Factors']} for details). See §\ref{['app:Sensitivity and Generalization']} for experimental details (§\ref{['app:Sensitivity and Generalization Factors']} for the data-augmentation study) and §\ref{['sec:How to Read Plots']} for plot interpretation.
  • Figure 5: Jacobian norm correlates with generalization gap on all considered datasets. Each point corresponds to a network trained to $100$% training accuracy (or at least $99.9$% in the case of CIFAR100). See §\ref{['app:Sensitivity and Generalization Factors']} and §\ref{['app:Sensitivity and Generalization']} for experimental details of bottom and top plots respectively.
  • ...and 5 more figures