Table of Contents
Fetching ...

On the universality of neural encodings in CNNs

Florentin Guth, Brice Ménard

TL;DR

This work investigates whether CNNs trained on different image datasets converge to a universal neural encoding by shifting focus from representations to learned weights. It introduces a space–channel factorization and covariance-based alignment to compare weight encodings across networks, revealing a canonical set of universal spatial eigenvectors and, for natural images, a broadly shared channel-eigenvector structure across layers. The authors develop a framework using Procrustes alignment, eigenvalue shrinkage, and Bures–Wasserstein-based similarity to quantify encodings and demonstrate universality over diverse datasets and tasks, with true-label versus random-label training showing two distinct encoding regimes. The findings provide a principled basis for understanding transfer learning and foundation-model-style universality, suggesting that part of deep learning success stems from universal encodings that can be preset or shared across architectures, reducing the need to learn these components anew.

Abstract

We explore the universality of neural encodings in convolutional neural networks trained on image classification tasks. We develop a procedure to directly compare the learned weights rather than their representations. It is based on a factorization of spatial and channel dimensions and measures the similarity of aligned weight covariances. We show that, for a range of layers of VGG-type networks, the learned eigenvectors appear to be universal across different natural image datasets. Our results suggest the existence of a universal neural encoding for natural images. They explain, at a more fundamental level, the success of transfer learning. Our work shows that, instead of aiming at maximizing the performance of neural networks, one can alternatively attempt to maximize the universality of the learned encoding, in order to build a principled foundation model.

On the universality of neural encodings in CNNs

TL;DR

This work investigates whether CNNs trained on different image datasets converge to a universal neural encoding by shifting focus from representations to learned weights. It introduces a space–channel factorization and covariance-based alignment to compare weight encodings across networks, revealing a canonical set of universal spatial eigenvectors and, for natural images, a broadly shared channel-eigenvector structure across layers. The authors develop a framework using Procrustes alignment, eigenvalue shrinkage, and Bures–Wasserstein-based similarity to quantify encodings and demonstrate universality over diverse datasets and tasks, with true-label versus random-label training showing two distinct encoding regimes. The findings provide a principled basis for understanding transfer learning and foundation-model-style universality, suggesting that part of deep learning success stems from universal encodings that can be preset or shared across architectures, reducing the need to learn these components anew.

Abstract

We explore the universality of neural encodings in convolutional neural networks trained on image classification tasks. We develop a procedure to directly compare the learned weights rather than their representations. It is based on a factorization of spatial and channel dimensions and measures the similarity of aligned weight covariances. We show that, for a range of layers of VGG-type networks, the learned eigenvectors appear to be universal across different natural image datasets. Our results suggest the existence of a universal neural encoding for natural images. They explain, at a more fundamental level, the success of transfer learning. Our work shows that, instead of aiming at maximizing the performance of neural networks, one can alternatively attempt to maximize the universality of the learned encoding, in order to build a principled foundation model.
Paper Structure (21 sections, 6 equations, 5 figures)

This paper contains 21 sections, 6 equations, 5 figures.

Figures (5)

  • Figure 1: Visualization of spatial filter eigenvectors learned by VGG networks in various settings. (a) Learned spatial eigenvectors on ImageNet for different filter sizes. For larger filter sizes, only the first $9$ eigenvectors are shown. The same set of eigenvector patterns can be seen for all layers $l\ge 2$, while the first layer (shown separately) displays a different behavior. On occasions, subsequent ranks appear in flipped order due to similar eigenvalues, as indicated with arrows. (b) When training with random labels on ImageNet, we recover the same set of filters, slightly distorted. The spatial eigenvectors can thus be mostly learned without labels. (c) Spatial eigenvectors for VGG networks trained on different datasets. The ImageNet dataset was resized to the $32\times 32$ resolution for direct comparison with the other datasets. The learned spatial eigenvectors are similar across datasets (and depth), and are similar to the filters learned on higher-resolution images. Note that only the first $6$ convolutional layers are relevant given the smaller size of the images.
  • Figure 2: A schematic view of the first two layers of two networks trained from different random initializations. At layer one, the weight eigenvectors are aligned by default as the input originates from the aligned image pixels. Collectively, the neurons act as an operator filtering certain directions of variation. Individually, each neuron defines an axis for the next layer. Expressed in this random basis, layer-two weight eigenvectors are no longer aligned between the two networks. An activation-based representation alignment can be used to meaningfully rotate one basis onto the other. The middle panels show the cosine similarities between weight covariance eigenvectors of the two networks trained on CIFAR10, with and without activation-based alignment. The bottom-right panel shows that almost all the performance originates from the range of ranks for which the correlation is detected.
  • Figure 3: Universality of the leading covariance eigenvectors. Each panel shows pairwise cosine similarities between the first weight covariance eigenvectors as a function of rank, for VGG networks with frozen spatial filters trained on different classification tasks. The colormap range is defined in terms of the expected level of correlation between two random vectors (which is $1/\sqrt d$), so that this base level corresponds to the color white and statistically significant correlations ($5$ times this base level) correspond to the color black. The axis arrows indicate the effective rank of the corresponding weight covariance spectra. They show how the dimensionality of the learned subspaces increases with depth. Top: Networks are trained on ImageNet at $32 \times 32$ resolution with different labels. The first row indicates the similarity between learned encodings when only the random initialization is changed (indicated by the suffix "bis"). The subsequent rows show that the channel encodings emerging during training with true versus random labels are fundamentally different. However, the channel eigenvectors learned on different realizations of random labels are similar to each other, suggesting a consistent encoding strategy for random label tasks. Bottom: Networks are trained on various subsets of the CIFAR10, CIFAR100 and ImageNet datasets. The panels show that similar weight eigenvectors are consistently learned across datasets for a range of layers.
  • Figure 4: Normalized covariance similarities (eq. \ref{['eq:normalized_cos_sim']}) for pairs of training tasks as a function of depth. The left panel confirms that the network learns different encodings when trained on true and random labels, but that this encoding does not depend on the realization of the random labels. The right panel quantifies the level of universality observed at each layer.
  • Figure 5: Visualization of the spatial filters learned by a VGG-11 network with $8$ convolutional layers and filters of size $7 \times 7$. At each layer, we only show the filters corresponding to the first $16$ input and output channels.