Saliency Suppressed, Semantics Surfaced: Visual Transformations in Neural Networks and the Brain
Gustaw Opiełka, Jessica Loke, Steven Scholte
TL;DR
The paper investigates how neural networks transform visual input into semantic representations and how these transformations compare to human vision. Using Representational Similarity Analysis (RSA), it quantifies alignment between network activations, low-level saliency, and high-level semantic embeddings, and introduces a distractor-based dataset to causally test the effects of saliency and semantics. Key findings show that CLIP training enhances semantic encoding in both ResNets and Vision Transformers (ViTs) and induces saliency suppression in early ResNet layers; semantic representations align with human visual processing in higher visual areas, while saliency suppression appears to be a network-specific strategy not mirrored in the brain. The NSD-based brain analyses reveal strong semantics-brain alignment (e.g., r≈0.83, p<0.001) and a significant but nonlinear saliency contribution (r≈0.64, p<0.001), underscoring both convergence and boundaries between artificial and biological visual transformations and guiding toward more brain-aligned AI systems.
Abstract
Deep learning algorithms lack human-interpretable accounts of how they transform raw visual input into a robust semantic understanding, which impedes comparisons between different architectures, training objectives, and the human brain. In this work, we take inspiration from neuroscience and employ representational approaches to shed light on how neural networks encode information at low (visual saliency) and high (semantic similarity) levels of abstraction. Moreover, we introduce a custom image dataset where we systematically manipulate salient and semantic information. We find that ResNets are more sensitive to saliency information than ViTs, when trained with object classification objectives. We uncover that networks suppress saliency in early layers, a process enhanced by natural language supervision (CLIP) in ResNets. CLIP also enhances semantic encoding in both architectures. Finally, we show that semantic encoding is a key factor in aligning AI with human visual perception, while saliency suppression is a non-brain-like strategy.
