Table of Contents
Fetching ...

On the importance of single directions for generalization

Ari S. Morcos, David G. T. Barrett, Neil C. Rabinowitz, Matthew Botvinick

TL;DR

The paper investigates why some neural networks generalize better than others by measuring how dependent they are on single activation-space directions through ablations and perturbations. It links higher dependence on low-dimensional directions to memorization and worse generalization, showing that regularizers like batch normalization can reduce this reliance, while dropout does not fully prevent it beyond training. It also challenges the idea that highly selective single units are highly important, showing that class selectivity poorly predicts a unit's impact on output, and that networks benefit from more distributed representations. Overall, the work suggests new ways to assess generalization and informs potential regularization strategies and interpretability approaches beyond single-unit selectivity.

Abstract

Despite their ability to memorize large datasets, deep neural networks often achieve good generalization performance. However, the differences between the learned solutions of networks which generalize and those which do not remain unclear. Additionally, the tuning properties of single directions (defined as the activation of a single unit or some linear combination of units in response to some input) have been highlighted, but their importance has not been evaluated. Here, we connect these lines of inquiry to demonstrate that a network's reliance on single directions is a good predictor of its generalization performance, across networks trained on datasets with different fractions of corrupted labels, across ensembles of networks trained on datasets with unmodified labels, across different hyperparameters, and over the course of training. While dropout only regularizes this quantity up to a point, batch normalization implicitly discourages single direction reliance, in part by decreasing the class selectivity of individual units. Finally, we find that class selectivity is a poor predictor of task importance, suggesting not only that networks which generalize well minimize their dependence on individual units by reducing their selectivity, but also that individually selective units may not be necessary for strong network performance.

On the importance of single directions for generalization

TL;DR

The paper investigates why some neural networks generalize better than others by measuring how dependent they are on single activation-space directions through ablations and perturbations. It links higher dependence on low-dimensional directions to memorization and worse generalization, showing that regularizers like batch normalization can reduce this reliance, while dropout does not fully prevent it beyond training. It also challenges the idea that highly selective single units are highly important, showing that class selectivity poorly predicts a unit's impact on output, and that networks benefit from more distributed representations. Overall, the work suggests new ways to assess generalization and informs potential regularization strategies and interpretability approaches beyond single-unit selectivity.

Abstract

Despite their ability to memorize large datasets, deep neural networks often achieve good generalization performance. However, the differences between the learned solutions of networks which generalize and those which do not remain unclear. Additionally, the tuning properties of single directions (defined as the activation of a single unit or some linear combination of units in response to some input) have been highlighted, but their importance has not been evaluated. Here, we connect these lines of inquiry to demonstrate that a network's reliance on single directions is a good predictor of its generalization performance, across networks trained on datasets with different fractions of corrupted labels, across ensembles of networks trained on datasets with unmodified labels, across different hyperparameters, and over the course of training. While dropout only regularizes this quantity up to a point, batch normalization implicitly discourages single direction reliance, in part by decreasing the class selectivity of individual units. Finally, we find that class selectivity is a poor predictor of task importance, suggesting not only that networks which generalize well minimize their dependence on individual units by reducing their selectivity, but also that individually selective units may not be necessary for strong network performance.

Paper Structure

This paper contains 18 sections, 1 equation, 11 figures.

Figures (11)

  • Figure 1: Memorizing networks are more sensitive to cumulative ablations. Networks were trained on MNIST (2-hidden layer MLP, a), CIFAR-10 (11-layer convolutional network, b), and ImageNet (50-layer ResNet, c). In a, all units in all layers were ablated, while in b and c, only feature maps in the last three layers were ablated. Error bars represent standard deviation across 10 random orderings of units to ablate.
  • Figure 2: Memorizing networks are more sensitive to random noise. Networks were trained on MNIST (2-hidden layer MLP, a), and CIFAR-10 (11-layer convolutional network, b). Noise was scaled by the empirical variance of each unit on the training set. Error bars represent standard deviation across 10 runs. X-axis is on a log scale.
  • Figure 3: Networks which generalize poorly are more reliant on single directions. 200 networks with identical topology were trained on unmodified CIFAR-10. a, Cumulative ablation curves for the best and worst 5 networks by generalization error. Error bars represent standard deviation across 5 models and 10 random orderings of feature maps per model. b, Area under cumulative ablation curve (normalized) as a function of generalization error.
  • Figure 4: Single direction reliance as a signal for hyperparameter selection and early stopping.a, Train (blue) and test (purple) loss, along with the normalized area under the cumulative ablation curve (AUC; green) over the course of training for an MNIST MLP. Loss y-axis has been cropped to make train/test divergence visible. b, AUC and test loss for a CIFAR-10 ConvNet are negatively correlated over the course of training. c, AUC and test accuracy are positively corrleated across a hyperparameter sweep (96 hyperparameters with 2 repeats for each).
  • Figure 5: Impact of regularizers on networks' reliance upon single directions.a, Cumulative ablation curves for MLPs trained on unmodified and fully corrupted MNIST with dropout fractions $\in \{0.1, 0.2, 0.3\}$. Colored dashed lines indicate number of units ablated equivalent to the dropout fraction used in training. Note that curves for networks trained on corrupted MNIST begin to drop soon past the dropout fraction with which they were trained. b, Cumulative ablation curves for networks trained on CIFAR-10 with and without batch normalization. Error bars represent standard deviation across 4 model instances and 10 random orderings of feature maps per model.
  • ...and 6 more figures