Table of Contents
Fetching ...

Saliency Suppressed, Semantics Surfaced: Visual Transformations in Neural Networks and the Brain

Gustaw Opiełka, Jessica Loke, Steven Scholte

TL;DR

The paper investigates how neural networks transform visual input into semantic representations and how these transformations compare to human vision. Using Representational Similarity Analysis (RSA), it quantifies alignment between network activations, low-level saliency, and high-level semantic embeddings, and introduces a distractor-based dataset to causally test the effects of saliency and semantics. Key findings show that CLIP training enhances semantic encoding in both ResNets and Vision Transformers (ViTs) and induces saliency suppression in early ResNet layers; semantic representations align with human visual processing in higher visual areas, while saliency suppression appears to be a network-specific strategy not mirrored in the brain. The NSD-based brain analyses reveal strong semantics-brain alignment (e.g., r≈0.83, p<0.001) and a significant but nonlinear saliency contribution (r≈0.64, p<0.001), underscoring both convergence and boundaries between artificial and biological visual transformations and guiding toward more brain-aligned AI systems.

Abstract

Deep learning algorithms lack human-interpretable accounts of how they transform raw visual input into a robust semantic understanding, which impedes comparisons between different architectures, training objectives, and the human brain. In this work, we take inspiration from neuroscience and employ representational approaches to shed light on how neural networks encode information at low (visual saliency) and high (semantic similarity) levels of abstraction. Moreover, we introduce a custom image dataset where we systematically manipulate salient and semantic information. We find that ResNets are more sensitive to saliency information than ViTs, when trained with object classification objectives. We uncover that networks suppress saliency in early layers, a process enhanced by natural language supervision (CLIP) in ResNets. CLIP also enhances semantic encoding in both architectures. Finally, we show that semantic encoding is a key factor in aligning AI with human visual perception, while saliency suppression is a non-brain-like strategy.

Saliency Suppressed, Semantics Surfaced: Visual Transformations in Neural Networks and the Brain

TL;DR

The paper investigates how neural networks transform visual input into semantic representations and how these transformations compare to human vision. Using Representational Similarity Analysis (RSA), it quantifies alignment between network activations, low-level saliency, and high-level semantic embeddings, and introduces a distractor-based dataset to causally test the effects of saliency and semantics. Key findings show that CLIP training enhances semantic encoding in both ResNets and Vision Transformers (ViTs) and induces saliency suppression in early ResNet layers; semantic representations align with human visual processing in higher visual areas, while saliency suppression appears to be a network-specific strategy not mirrored in the brain. The NSD-based brain analyses reveal strong semantics-brain alignment (e.g., r≈0.83, p<0.001) and a significant but nonlinear saliency contribution (r≈0.64, p<0.001), underscoring both convergence and boundaries between artificial and biological visual transformations and guiding toward more brain-aligned AI systems.

Abstract

Deep learning algorithms lack human-interpretable accounts of how they transform raw visual input into a robust semantic understanding, which impedes comparisons between different architectures, training objectives, and the human brain. In this work, we take inspiration from neuroscience and employ representational approaches to shed light on how neural networks encode information at low (visual saliency) and high (semantic similarity) levels of abstraction. Moreover, we introduce a custom image dataset where we systematically manipulate salient and semantic information. We find that ResNets are more sensitive to saliency information than ViTs, when trained with object classification objectives. We uncover that networks suppress saliency in early layers, a process enhanced by natural language supervision (CLIP) in ResNets. CLIP also enhances semantic encoding in both architectures. Finally, we show that semantic encoding is a key factor in aligning AI with human visual perception, while saliency suppression is a non-brain-like strategy.
Paper Structure (19 sections, 5 equations, 12 figures, 1 table)

This paper contains 19 sections, 5 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Experimental approach. (a.) Calculating the alignment between network features and visual saliency/semantics. Saliency maps and caption embeddings from the COCO images were converted into RDMs. Their correlation with network feature RDMs establishes the degree of alignment - Saliency/Semantics RSA. (b.) Sensitivity to distractors. Network features were extracted from images with all 4 distractor types (see Figure \ref{['fig:distractors']}). These RDMs were correlated with the original saliency and semantic RDMs (from a) to establish RSA resulting from seeing the distractors. Taking the absolute difference with the baseline RSA (from a) we get $\Delta$RSA Saliency/Semantics which measures network alignment to original low- and high-level image content amidst distractors.
  • Figure 2: Representation of semantic information generally increases as a function of layer depth in all networks. CLIP enhances the amount of semantic information represented by the networks, mostly in later layers. Notably, however, we see a negative alignment between layer representations and saliency maps in ResNets trained with CLIP, suggesting saliency suppression. Architecturally, ResNets encode more saliency information than ViTs (p < .001). Note: for visualization purposes, the values are averaged across several layers. To view the raw values, see Appendix \ref{['raw_rsa']}
  • Figure 3: Visual and semantic distractors. (a.) Left: For the central image (target), the top images have dissimilar captions and the bottom ones are similar. Right: Saliency maps of the target image and images with distractors. Top images alter the target's salient features more than the bottom ones. Red outlines indicate distractor locations. Distributions on the left and right illustrate semantic similarity and visual saliency thresholds, respectively. (b.) Images with four distractor types, with numbers corresponding to saliency maps in a.
  • Figure 4: $\Delta$RSA Saliency. Higher $\Delta$RSA values values indicate greater disruption in network saliency representations as a result of 4 different distractors (Control, Salient, Semantic, and Semantic & Salient). We see an effect of training objective in ResNets: salient distractors cause more disruption in later layers of ImageNet-trained ResNets compared to those trained with CLIP. We also see an architectural difference: smaller ViTs (see Figure \ref{['fig:ViT-delta_appendix']} to view ViT-B16), show less disruption than ResNets to salient distractors. In the largest CLIP-trained ViT there is an interaction effect of saliency & semantics.
  • Figure 6: Saliency and semantic representation in the brain. Similar to neural networks (as shown in Figure \ref{['fig:baseline']}), the deeper into the visual processing hierarchy, the more the neural representations align with semantics. In contrast, saliency exhibits a different pattern: it is most pronounced in the early cortex and diminishes in higher processing stages. Notably, unlike in neural networks, we do not observe a negative alignment with saliency.
  • ...and 7 more figures