Table of Contents
Fetching ...

Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth

Thao Nguyen, Maithra Raghu, Simon Kornblith

TL;DR

The paper empirically investigates how neural network width and depth shape learned representations and outputs. It introduces minibatch CKA to compare hidden representations across ResNet variants trained on CIFAR-10/100 and ImageNet, revealing a block-structured pattern that emerges with overparameterization relative to data. This block structure corresponds to preserving a dominant first principal component across layers and is largely unique to each model, though non-block regions show shared features across architectures. Despite similar overall accuracy, wide and deep networks exhibit systematic per-example and per-class differences in predictions, suggesting complementary strengths for different task aspects and guiding considerations for architecture design and pruning.

Abstract

A key factor in the success of deep neural networks is the ability to scale models to improve performance by varying the architecture depth and width. This simple property of neural network design has resulted in highly effective architectures for a variety of tasks. Nevertheless, there is limited understanding of effects of depth and width on the learned representations. In this paper, we study this fundamental question. We begin by investigating how varying depth and width affects model hidden representations, finding a characteristic block structure in the hidden representations of larger capacity (wider or deeper) models. We demonstrate that this block structure arises when model capacity is large relative to the size of the training set, and is indicative of the underlying layers preserving and propagating the dominant principal component of their representations. This discovery has important ramifications for features learned by different models, namely, representations outside the block structure are often similar across architectures with varying widths and depths, but the block structure is unique to each model. We analyze the output predictions of different model architectures, finding that even when the overall accuracy is similar, wide and deep models exhibit distinctive error patterns and variations across classes.

Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth

TL;DR

The paper empirically investigates how neural network width and depth shape learned representations and outputs. It introduces minibatch CKA to compare hidden representations across ResNet variants trained on CIFAR-10/100 and ImageNet, revealing a block-structured pattern that emerges with overparameterization relative to data. This block structure corresponds to preserving a dominant first principal component across layers and is largely unique to each model, though non-block regions show shared features across architectures. Despite similar overall accuracy, wide and deep networks exhibit systematic per-example and per-class differences in predictions, suggesting complementary strengths for different task aspects and guiding considerations for architecture design and pruning.

Abstract

A key factor in the success of deep neural networks is the ability to scale models to improve performance by varying the architecture depth and width. This simple property of neural network design has resulted in highly effective architectures for a variety of tasks. Nevertheless, there is limited understanding of effects of depth and width on the learned representations. In this paper, we study this fundamental question. We begin by investigating how varying depth and width affects model hidden representations, finding a characteristic block structure in the hidden representations of larger capacity (wider or deeper) models. We demonstrate that this block structure arises when model capacity is large relative to the size of the training set, and is indicative of the underlying layers preserving and propagating the dominant principal component of their representations. This discovery has important ramifications for features learned by different models, namely, representations outside the block structure are often similar across architectures with varying widths and depths, but the block structure is unique to each model. We analyze the output predictions of different model architectures, finding that even when the overall accuracy is similar, wide and deep models exhibit distinctive error patterns and variations across classes.

Paper Structure

This paper contains 23 sections, 1 theorem, 9 equations, 21 figures, 3 tables.

Key Result

Proposition 1

Let $\bm{K} \in \mathbb{R}^{m \times m}$ and $\bm{L} \in \mathbb{R}^{m \times m}$ be two kernel matrices constructed by applying kernel functions $k$ and $l$ respectively to all pairs of examples in a dataset $\mathcal{D}$. Form $c$ random partitionings $p$ of $\mathcal{D}$ into $m/n$ minibatches $b

Figures (21)

  • Figure 1: Emergence of the block structure with increasing width or depth. As we increase the depth or width of neural networks, we see the emergence of a large, contiguous set of layers with very similar representations --- the block structure. Each of the panes of the figure computes the CKA similarity between all pairs of layers in a single neural network and plots this as a heatmap, with x and y axes indexing layers. See Appendix Figure \ref{['fig:wide_deep_no_residual']} for block structure in wide networks without residual connections.
  • Figure 2: Block structure emerges in narrower networks when trained on less data. We plot CKA similarity heatmaps as we increase network width (going right along each row) and also decrease the dataset size (down each column). As a result of the increased model capacity (with respect to the task) from smaller dataset size, smaller (narrower) models now also exhibit the block structure.
  • Figure 3: Block structure arises from preserving and propagating the (dominant) first principal component of the layer representations. Above are two sets of four plots, for layers of a deep network (left) and a wide network (right). CKA of the representations (top right), shows block structure in both networks. By comparing this to the variance explained by the top principal component of each layer representation (bottom left), we see that layers in the block structure have a highly dominant first principal component. This principal component is also preserved throughout the block structure, seen by comparing the squared cosine similarity of the first principal component across pairs of layers (top left), to the CKA representation similarity (top right). Compared to the latter, after removing the first principal component from the representations (bottom right), the block structure is highly reduced --- the block structure arises from propagating the first principal component.
  • Figure 4: Linear probe accuracy. Top: CKA between layers of individual ResNet models, for different architectures and initializations. Bottom: Accuracy of linear probes for each of the layers before (orange) and after (blue) the residual connections.
  • Figure 5: Effect of deleting blocks on accuracy for models with and without block structure. Blue lines show the effect of deleting blocks backwards one-by-one within each ResNet stage. (Note the plateau at the block structure.) Vertical green lines reflect boundaries between ResNet stages. Horizontal gray line reflects accuracy of the full model.
  • ...and 16 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof