Table of Contents
Fetching ...

SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability

Maithra Raghu, Justin Gilmer, Jason Yosinski, Jascha Sohl-Dickstein

TL;DR

SVCCA introduces a two-step, affine-invariant method for comparing neural representations by combining SVD with CCA to analyze activation subspaces. It scales to convolutional layers via a Fourier-based approach, enabling efficient, cross-layer and cross-architecture comparisons. The study reveals bottom-up convergence during learning, motivates Freeze Training to reduce computation, and shows how class semantics manifest in representation sensitivity, with practical implications for compression and interpretability. Overall, SVCCA offers a robust tool for analyzing and optimizing deep networks across architectures and tasks.

Abstract

We propose a new technique, Singular Vector Canonical Correlation Analysis (SVCCA), a tool for quickly comparing two representations in a way that is both invariant to affine transform (allowing comparison between different layers and networks) and fast to compute (allowing more comparisons to be calculated than with previous methods). We deploy this tool to measure the intrinsic dimensionality of layers, showing in some cases needless over-parameterization; to probe learning dynamics throughout training, finding that networks converge to final representations from the bottom up; to show where class-specific information in networks is formed; and to suggest new training regimes that simultaneously save computation and overfit less. Code: https://github.com/google/svcca/

SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability

TL;DR

SVCCA introduces a two-step, affine-invariant method for comparing neural representations by combining SVD with CCA to analyze activation subspaces. It scales to convolutional layers via a Fourier-based approach, enabling efficient, cross-layer and cross-architecture comparisons. The study reveals bottom-up convergence during learning, motivates Freeze Training to reduce computation, and shows how class semantics manifest in representation sensitivity, with practical implications for compression and interpretability. Overall, SVCCA offers a robust tool for analyzing and optimizing deep networks across architectures and tasks.

Abstract

We propose a new technique, Singular Vector Canonical Correlation Analysis (SVCCA), a tool for quickly comparing two representations in a way that is both invariant to affine transform (allowing comparison between different layers and networks) and fast to compute (allowing more comparisons to be calculated than with previous methods). We deploy this tool to measure the intrinsic dimensionality of layers, showing in some cases needless over-parameterization; to probe learning dynamics throughout training, finding that networks converge to final representations from the bottom up; to show where class-specific information in networks is formed; and to suggest new training regimes that simultaneously save computation and overfit less. Code: https://github.com/google/svcca/

Paper Structure

This paper contains 29 sections, 5 theorems, 17 equations, 13 figures.

Key Result

Theorem 1

Suppose we have a translation invariant (image) dataset $X$ and convolutional layers $l_1$, $l_2$. Letting $DFT(l_i)$ denote the discrete fourier transform applied to each channel of $l_i$, the covariance $cov(DFT(l_1), DFT(l_2))$ is block diagonal, with blocks of size $c \times c$.

Figures (13)

  • Figure 1: To demonstrate SVCCA, we consider a toy regression task (regression target as in Figure \ref{['fig-SVCCA-importance']}). (a) We train two networks with four fully connected hidden layers starting from different random initializations, and examine the representation learned by the penultimate (shaded) layer in each network. (b) The neurons with the highest activations in net 1 (maroon) and in net 2 (green). The x-axis indexes over the dataset: in our formulation, the representation of a neuron is simply its value over a dataset (Section \ref{['sec:method']}). (c) The SVD directions --- i.e. the directions of maximal variance --- for each network. (d) The top SVCCA directions. We see that each pair of maroon/green lines (starting from the top) are almost visually identical (up to a sign). Thus, although looking at just neurons (b) seems to indicate that the networks learn very different representations, looking at the SVCCA subspace (d) shows that the information in the representations are (up to a sign) nearly identical.
  • Figure 2: Demonstration of (a) disproportionate importance of SVCCA directions, and (b) distributed nature of some of these directions. For both panes, we first find the top $k$ SVCCA directions by training two conv nets on CIFAR-10 and comparing corresponding layers. (a) We project the output of the top three layers, pool1, fc1, fc2, onto this top-$k$ subspace. We see accuracy rises rapidly with increasing $k$, with even $k \ll \mathrm{num~neurons}$ giving reasonable performance, with no retraining. Baselines of random $k$ neuron subspaces and max activation neurons require larger $k$ to perform as well. (b): after projecting onto top $k$ subspace (like left), dotted lines then project again onto $m$ neurons, chosen to correspond highly to the top $k$-SVCCA subspace. Many more neurons are needed than $k$ for better performance, suggesting distributedness of SVCCA directions.
  • Figure 3: The effect on the output of a latent representation being projected onto top SVCCA directions in the toy regression task. Representations of the penultimate layer are projected onto $2, 6, 15, 30$ top SVCCA directions (from second pane). By $30$, the output looks very similar to the full $200$ neuron output (left).
  • Figure 4: Learning dynamics plots for conv (top) and res (bottom) nets trained on CIFAR-10. Each pane is a matrix of size layers $\times$ layers, with each entry showing the SVCCA similarity $\bar{\rho}$ between the two layers. Note that learning broadly happens 'bottom up' -- layers closer to the input seem to solidify into their final representations with the exception of the very top layers. Per layer plots are included in the Appendix. Other patterns are also visible -- batch norm layers maintain nearly perfect similarity to the layer preceding them due to scaling invariance (with a slight reduction since batch norm changes the SVD directions which capture 99% of the variance). In the resnet plot, we see a stripe like pattern due to skip connections inducing high similarities to previous layers.
  • Figure 5: Freeze Training reduces training cost and improves generalization. We apply Freeze Training to a convolutional network on CIFAR-10 and a residual network on CIFAR-10. As shown by the grey dotted lines (which indicate the timestep at which another layer is frozen), both networks have a 'linear' freezing regime: for the convolutional network, we freeze individual layers at evenly spaced timesteps throughout training. For the residual network, we freeze entire residual blocks at each freeze step. The curves were averaged over ten runs.
  • ...and 8 more figures

Theorems & Definitions (10)

  • Theorem 1
  • Definition 1
  • Theorem 2
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • proof
  • proof
  • proof
  • proof