Table of Contents
Fetching ...

Shared representations in brains and models reveal a two-route cortical organization during scene perception

Pablo Marcos-Manchón, Lluís Fuentemilla

TL;DR

These findings refine classical visual-stream models by characterizing scene perception as a distributed cortical network with separable representational routes for context and animate content.

Abstract

The brain transforms visual inputs into high-dimensional cortical representations that support diverse cognitive and behavioral goals. Characterizing how this information is organized and routed across the human brain is essential for understanding how we process complex visual scenes. Here, we applied representational similarity analysis to 7T fMRI data collected during natural scene viewing. We quantified representational geometry shared across individuals and compared it to hierarchical features from vision and language neural networks. This analysis revealed two distinct processing routes: a ventromedial pathway specialized for scene layout and environmental context, and a lateral occipitotemporal pathway selective for animate content. Vision models aligned with shared structure in both routes, whereas language models corresponded primarily with the lateral pathway. These findings refine classical visual-stream models by characterizing scene perception as a distributed cortical network with separable representational routes for context and animate content.

Shared representations in brains and models reveal a two-route cortical organization during scene perception

TL;DR

These findings refine classical visual-stream models by characterizing scene perception as a distributed cortical network with separable representational routes for context and animate content.

Abstract

The brain transforms visual inputs into high-dimensional cortical representations that support diverse cognitive and behavioral goals. Characterizing how this information is organized and routed across the human brain is essential for understanding how we process complex visual scenes. Here, we applied representational similarity analysis to 7T fMRI data collected during natural scene viewing. We quantified representational geometry shared across individuals and compared it to hierarchical features from vision and language neural networks. This analysis revealed two distinct processing routes: a ventromedial pathway specialized for scene layout and environmental context, and a lateral occipitotemporal pathway selective for animate content. Vision models aligned with shared structure in both routes, whereas language models corresponded primarily with the lateral pathway. These findings refine classical visual-stream models by characterizing scene perception as a distributed cortical network with separable representational routes for context and animate content.

Paper Structure

This paper contains 26 sections, 19 equations, 26 figures, 1 table.

Figures (26)

  • Figure 1: Overview of the analysis pipeline.(A) Feature extraction. For each image stimulus, we extracted corresponding representations from brain activity and deep neural networks. Single-trial fMRI responses were aggregated within cortical parcels to create vector representations of the brain's response. Concurrently, layer-wise activations were extracted from pre-trained vision and language models to obtain representations across the full model hierarchies. Both brain and model vectors were used to compute representational dissimilarity matrices (RDMs). Example input image adapted from Wikimedia (Bengt Nyman, CC BY 2.0). (B) Representational alignment. Representational Similarity Analysis (RSA) was used in two ways: (i) inter-subject RSA (IS-RSA), correlating parcel-wise RDMs across participants to estimate shared representational geometry; and (ii) brain–model RSA, correlating parcel RDMs with model-layer RDMs to quantify brain–model and layer-wise alignment profiles. (C) Representational connectivity. IS-RSA between pairs of parcels was used to construct a cortical network based on how similarly regions encode the stimulus set. The directionality of information flow was inferred from the peak model-layer alignment established in (B). (D) Shared dimensions. Within the main hubs identified in (C), Kernel Multi-view Canonical Correlation Analysis (KMCCA) was used to decompose the shared geometry into latent dimensions common across participants and to relate these dimensions to coarse scene features that drive alignment in different parts of the network.
  • Figure 1: Detailed parcel-level alignment for all modalities. Box plots show representational alignment scores computed for each of the 180 cortical parcels of the symmetric HCP atlas Glasser2016, organized by macro-anatomical groups Huang2022. Each box represents the distribution of alignment scores across the eight participants ($N=8$). The red line and shaded area in each panel denote the mean and standard deviation of the null distribution, respectively, estimated via permutation testing. (A) Inter-subject alignment (IS-RSA). (B) Brain-to-vision-model alignment. (C) Brain-to-language-model alignment. This figure provides a detailed view of the results summarized in the main text, showing the parcel-by-parcel variability and confirming the concentration of high alignment within the Early Visual, Ventral, and LOTC hubs.
  • Figure 2: Inter-subject and model–brain alignment across cortex.(A) Inter-subject representational alignment (IS-RSA; Pearson’s $r$ between parcel-wise RDMs across participants) for parcels drawn from the macro-anatomical clusters with the highest mean IS-RSA (color legend). Boxplots show the distribution across subjects in the NSD sample ($N=8$, symmetric HCP–MMP atlas). Red lines indicate the parcel-wise null distribution (mean $\pm$ s.d.; 10,000 permutations). Unless otherwise annotated in the plot, parcels are significant at *** ($p<0.001$; two-tailed, FDR-corrected); additional annotations indicate ** ($p<0.01$), * ($p<0.05$), or n.s. ($p\ge 0.05$). Peak alignment occurs in early visual cortex (V1–V4), a ventral hub (VMV1–3, PHA1–3), and a lateral occipitotemporal (LOTC) hub (V4t, MT, MST, FST, TPOJ2–3). Full-atlas results are shown in Supplementary Fig. \ref{['fig:extended_spatial_distribution']}. (B–D) Cortical surface maps of alignment, averaged across subjects (and across models within each modality): (B) inter-subject IS-RSA; (C) vision–model RSA (maximum across layers, averaged across vision models); (D) language–model RSA (maximum across layers, averaged across language models). (E–F) Parcel-wise relationship between IS-RSA and model–brain alignment for (E) vision models and (F) language models. Points are colored by macro-anatomical group (as in panel A); other parcels are shown in white. Vision parcels follow an approximate power-law fit ($R^{2}=0.94$, shaded band: 95% bootstrap CI), whereas language-model alignment clusters near zero or negative except for parcels in and around the LOTC hub. (G) Modality comparison within the three hubs. Boxplots show hub alignment for vision and language models in early visual cortex, the ventral hub, and the LOTC hub (averaged across models). Paired two-tailed $t$-tests ($t(7)$, FDR-corrected) indicate stronger vision-model alignment in early visual and ventral hubs, and stronger language-model alignment in the LOTC hub.
  • Figure 2: Cortical flat map projections of representational alignment. To provide a comprehensive view of their spatial distribution, group-level alignment scores are projected onto a flattened cortical surface. The figure shows (A) IS-RSA alignment, (B) vision-model alignment, and (C) language-model alignment. Values represent group-averaged RSA scores (Pearson's $\rho$, $N=8$ from the NSD dataset), and major sulci are labeled for anatomical orientation. This visualization makes the full extent of the alignment patterns clear, particularly highlighting the widespread negative alignment (cool colors) between language models and the ventral visual stream, in contrast to the positive alignment seen for inter-subject and vision-model comparisons.
  • Figure 3: Hierarchical correspondence between model layers and cortical regions.(A–C) Layer-wise alignment (RSA) between vision models and representative parcels in the three hubs, averaged across all vision models. Panels show early visual cortex (A), the Ventral hub (B), and the LOTC hub (C). The $x$-axis expresses depth as a percentage of the total number of layers in each vision model. Curves show the mean alignment across participants ($N=8$); shaded bands indicate the standard error of the mean (SEM). Early visual parcels peak in shallow layers, Ventral parcels show distributed alignment across the hierarchy, and LOTC parcels peak in the deepest layers. (D) Cortical surface map of the vision-model layer that yields the highest alignment for each parcel (averaged across vision models). Colors encode normalized peak-layer depth (0% = shallowest layer, 100% = deepest layer), revealing a posterior–anterior gradient from shallow (green) to deep (purple) alignment. (E) Layer-wise alignment between language models and representative parcels from the three hubs (mean across language models). Alignment values are negligible in early visual and Ventral parcels and are restricted to LOTC, where curves rise quickly and then form a plateau rather than a smooth hierarchical progression. (F) Distribution across participants ($N=8$) of peak alignment depths for the 20 parcels with the highest overall vision–model alignment. For each subject and parcel, peak depth is defined as the mean depth (across vision models) of the layer at which RSA is maximal. Boxplots are ordered by median peak depth and colored by macro-anatomical cluster.
  • ...and 21 more figures