Table of Contents
Fetching ...

Universal dimensions of visual representation

Zirui Chen, Michael F. Bonner

TL;DR

The paper demonstrates that deep vision systems converge on a small set of universal, brain-aligned representational dimensions that generalize across architectures, initializations, and tasks. By analyzing hundreds of thousands of principal components from diverse networks and comparing them to fMRI data from the Natural Scenes Dataset, the authors show that a subset of dimensions is consistently learned and highly predictive of human visual representations. Reducing networks to just their top universal dimensions preserves or enhances representational similarity with the visual cortex, revealing that high-level semantic structure is encoded in universal subspaces. These findings suggest that universal image representations underpin both artificial and biological vision, with implications for initialization, data efficiency, and cross-species theories of vision.

Abstract

Do neural network models of vision learn brain-aligned representations because they share architectural constraints and task objectives with biological vision or because they learn universal features of natural image processing? We characterized the universality of hundreds of thousands of representational dimensions from visual neural networks with varied construction. We found that networks with varied architectures and task objectives learn to represent natural images using a shared set of latent dimensions, despite appearing highly distinct at a surface level. Next, by comparing these networks with human brain representations measured with fMRI, we found that the most brain-aligned representations in neural networks are those that are universal and independent of a network's specific characteristics. Remarkably, each network can be reduced to fewer than ten of its most universal dimensions with little impact on its representational similarity to the human brain. These results suggest that the underlying similarities between artificial and biological vision are primarily governed by a core set of universal image representations that are convergently learned by diverse systems.

Universal dimensions of visual representation

TL;DR

The paper demonstrates that deep vision systems converge on a small set of universal, brain-aligned representational dimensions that generalize across architectures, initializations, and tasks. By analyzing hundreds of thousands of principal components from diverse networks and comparing them to fMRI data from the Natural Scenes Dataset, the authors show that a subset of dimensions is consistently learned and highly predictive of human visual representations. Reducing networks to just their top universal dimensions preserves or enhances representational similarity with the visual cortex, revealing that high-level semantic structure is encoded in universal subspaces. These findings suggest that universal image representations underpin both artificial and biological vision, with implications for initialization, data efficiency, and cross-species theories of vision.

Abstract

Do neural network models of vision learn brain-aligned representations because they share architectural constraints and task objectives with biological vision or because they learn universal features of natural image processing? We characterized the universality of hundreds of thousands of representational dimensions from visual neural networks with varied construction. We found that networks with varied architectures and task objectives learn to represent natural images using a shared set of latent dimensions, despite appearing highly distinct at a surface level. Next, by comparing these networks with human brain representations measured with fMRI, we found that the most brain-aligned representations in neural networks are those that are universal and independent of a network's specific characteristics. Remarkably, each network can be reduced to fewer than ten of its most universal dimensions with little impact on its representational similarity to the human brain. These results suggest that the underlying similarities between artificial and biological vision are primarily governed by a core set of universal image representations that are convergently learned by diverse systems.
Paper Structure (29 sections, 2 equations, 17 figures, 3 tables)

This paper contains 29 sections, 2 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Overview of method for computing universality and brain similarity of network dimensions.(a) Four sets of deep neural networks were analyzed, including three sets of trained models that varied in either their random initializations, architectures, or task objectives and one set of untrained models with different initializations. (b) Universality and brain similarity were defined as the average prediction accuracy of a latent dimension from a target network when using the activations of other networks or the fMRI activations of the human brain as predictors. Dimensions that can be consistently predicted from the representations of other networks have high universality. Dimensions that can be consistently predicted from the representations of the human brain have high brain similarity.
  • Figure 2: Universality and brain similarity of network dimensions. Universality and brain similarity were computed for representational dimensions in four sets of deep neural networks. These included three sets of trained networks with varied initializations, architectures, and task objectives and one set of untrained networks. These metrics were computed for the principal components of network activations extracted from the sampled layers of each network. Universality scores reflect the degree to which a representational dimension is shared across all networks in a set, and brain similarity scores reflect the degree to which a representational dimension is shared with the human visual system. Measurements of human visual cortex activity were obtained from the Natural Scenes fMRI Dataset using a general region of interest that included all visually responsive voxels allen2022massive. Universality and brain similarity scores are plotted for all analyzed network dimensions. These plots show the density of dimensions on a logarithmic scale, with densities computed using kernel density estimation. The orange dots show the mean universality and brain similarity scores for equally sized quantiles of 100 dimensions along the x-axis. These plots show similar trends for all three sets of trained models (the first three plots on the left). Specifically, they exhibit a high density of points near the origin, showing that most dimensions are idiosyncratic to each network and are not shared with the human brain. However, they also contain a subset of dimensions with exceptionally high universality and brain similarity scores. These latter dimensions correspond to representations that are consistently learned by all networks within a set and are also strongly shared with the visual representations of the human brain. In contrast, untrained networks (right panel) can also have shared dimensions, but these shared untrained dimensions have relatively weak brain similarity scores.
  • Figure 3: Universality and brain similarity across network layers. These plots show the universality and brain similarity scores for individual network layers spanning the full depth of each network. Four sets of deep neural networks were examined, including three sets of trained networks with varied initializations, architectures, and tasks and one set of untrained networks. The analyses are the same as in Figure \ref{['fig_2']}, but here the results are plotted as the average values for individual layers, which are labeled according to their relative depth. Further details of these network layers are included in the supplement file https://github.com/zche377/universal_dimensions/blob/main/src/lib/models/model_layers.csv. Average universality and brain similarity scores were computed for equally sized quantiles of 100 dimensions along the x-axis for each layer. Panels on the sides of each plot show the density of dimensions on a logarithmic scale computed using kernel density estimation. As in Figure \ref{['fig_2']}, these plots exhibit a high density of points near the origin, which means that across all sampled layers, most dimensions are idiosyncratic and are not shared with the human brain. However, the three sets of trained networks (first three plots on the left) also contain a subset of dimensions at the right end of each plot that have exceptionally high universality and brain similarity scores. Importantly, these layer-wise plots show that a consistent trend is observed across all sampled layers and that universal dimensions are not restricted to early network layers.
  • Figure 4: Two-dimensional visualization of high-level universal representations. Image activations for the 100 most universal dimensions from a high-level network layer were embedded in two dimensions using uniform manifold approximation and projection. Specifically, image activations were obtained for the top 100 dimensions with the highest universality scores in the penultimate layer from the set of models trained on different tasks. For visualization purposes, this figure only includes images shown to a single subject. This plot shows that universal dimensions do not simply reflect low-level image features but instead capture high-level properties that group images into semantically related clusters, some of which are highlighted here, including animals, food, sports, and people. In contrast, Figure \ref{['sup_5']} shows a visualization of the 100 least universal dimensions from the same network layer, and it shows no clear semantic organization.
  • Figure 5: Universal dimensions underlie the results of conventional representational similarity analyses Representational similarity analysis (RSA) was used to compare the representations of neural networks and visual cortex. These analyses were performed using the same general region of interest in visual cortex and the same sets of neural networks as in Figures \ref{['fig_2']} and \ref{['fig_3']}. Representational dissimilarity matrices (RDMs) were created by calculating Pearson correlation distances for pairwise comparisons of image representations within each network and each fMRI subject. RSA scores were obtained by calculating the Spearman correlation between the RDMs for a network and an fMRI subject. These RSA scores were averaged across subjects. For each network, the best-performing layer was selected using a set of training data, and a final RSA score was computed on held-out test data. These analyses show the results of RSA for networks whose representations were either intact or reduced to subspaces of their top ten or five universal dimensions. In these plots, each dot is a network, and lines connect different versions of a network containing either all, ten, or five dimensions. The violin plots show distributions of RSA scores across networks. Even after drastically reducing the networks to just ten or five universal dimensions, the RSA scores exhibit little or no decrease---in fact, for all three sets of trained networks, the RSA scores slightly improve. These results demonstrate that conventional measures of representational similarity between neural networks and visual cortex are largely driven by the subspaces of universal dimensions contained within each network. Similar trends were observed within each individual subject and in other regions of interest, as shown in Figures \ref{['sup_7']} and \ref{['sup_8']}.
  • ...and 12 more figures