Universal dimensions of visual representation
Zirui Chen, Michael F. Bonner
TL;DR
The paper demonstrates that deep vision systems converge on a small set of universal, brain-aligned representational dimensions that generalize across architectures, initializations, and tasks. By analyzing hundreds of thousands of principal components from diverse networks and comparing them to fMRI data from the Natural Scenes Dataset, the authors show that a subset of dimensions is consistently learned and highly predictive of human visual representations. Reducing networks to just their top universal dimensions preserves or enhances representational similarity with the visual cortex, revealing that high-level semantic structure is encoded in universal subspaces. These findings suggest that universal image representations underpin both artificial and biological vision, with implications for initialization, data efficiency, and cross-species theories of vision.
Abstract
Do neural network models of vision learn brain-aligned representations because they share architectural constraints and task objectives with biological vision or because they learn universal features of natural image processing? We characterized the universality of hundreds of thousands of representational dimensions from visual neural networks with varied construction. We found that networks with varied architectures and task objectives learn to represent natural images using a shared set of latent dimensions, despite appearing highly distinct at a surface level. Next, by comparing these networks with human brain representations measured with fMRI, we found that the most brain-aligned representations in neural networks are those that are universal and independent of a network's specific characteristics. Remarkably, each network can be reduced to fewer than ten of its most universal dimensions with little impact on its representational similarity to the human brain. These results suggest that the underlying similarities between artificial and biological vision are primarily governed by a core set of universal image representations that are convergently learned by diverse systems.
