Table of Contents
Fetching ...

Vision-Language Models Align with Human Neural Representations in Concept Processing

Anna Bavaresco, Marianne de Heer Kloots, Sandro Pezzelle, Raquel Fernández

TL;DR

This study systematically probes whether vision-language modalities align with human neural representations of concepts, examining architecture families and the role of contextual input. Using Representational Similarity Analysis ($RSA$) against fMRI data from the Pereira dataset under sentence and picture contexts, the authors compare ten off-the-shelf models (six VLMs across three families and language-only baselines) and perform two ablation analyses to disentangle semantic vs visual contributions. The findings show that vision-language encoders typically achieve stronger brain alignment than unimodal or generative counterparts, and that genuine multimodal pretraining—not just inference-time input—drives this alignment, with notable differences across the left-language network and the visual network. These results inform the design of human-aligned multimodal models and advance neuroAI perspectives on multimodal grounding, while acknowledging architecture- and context-dependent limitations and pointing to future refinements.

Abstract

Recent studies suggest that transformer-based vision-language models (VLMs) capture the multimodality of concept processing in the human brain. However, a systematic evaluation exploring different types of VLM architectures and the role played by visual and textual context is still lacking. Here, we analyse multiple VLMs employing different strategies to integrate visual and textual modalities, along with language-only counterparts. We measure the alignment between concept representations by models and existing (fMRI) brain responses to concept words presented in two experimental conditions, where either visual (pictures) or textual (sentences) context is provided. Our results reveal that VLMs outperform the language-only counterparts in both experimental conditions. However, controlled ablation studies show that only for some VLMs, such as LXMERT and IDEFICS2, brain alignment stems from genuinely learning more human-like concepts during pretraining, while others are highly sensitive to the context provided at inference. Additionally, we find that vision-language encoders are more brain-aligned than more recent, generative VLMs. Altogether, our study shows that VLMs align with human neural representations in concept processing, while highlighting differences among architectures. We open-source code and materials to reproduce our experiments at: https://github.com/dmg-illc/vl-concept-processing.

Vision-Language Models Align with Human Neural Representations in Concept Processing

TL;DR

This study systematically probes whether vision-language modalities align with human neural representations of concepts, examining architecture families and the role of contextual input. Using Representational Similarity Analysis () against fMRI data from the Pereira dataset under sentence and picture contexts, the authors compare ten off-the-shelf models (six VLMs across three families and language-only baselines) and perform two ablation analyses to disentangle semantic vs visual contributions. The findings show that vision-language encoders typically achieve stronger brain alignment than unimodal or generative counterparts, and that genuine multimodal pretraining—not just inference-time input—drives this alignment, with notable differences across the left-language network and the visual network. These results inform the design of human-aligned multimodal models and advance neuroAI perspectives on multimodal grounding, while acknowledging architecture- and context-dependent limitations and pointing to future refinements.

Abstract

Recent studies suggest that transformer-based vision-language models (VLMs) capture the multimodality of concept processing in the human brain. However, a systematic evaluation exploring different types of VLM architectures and the role played by visual and textual context is still lacking. Here, we analyse multiple VLMs employing different strategies to integrate visual and textual modalities, along with language-only counterparts. We measure the alignment between concept representations by models and existing (fMRI) brain responses to concept words presented in two experimental conditions, where either visual (pictures) or textual (sentences) context is provided. Our results reveal that VLMs outperform the language-only counterparts in both experimental conditions. However, controlled ablation studies show that only for some VLMs, such as LXMERT and IDEFICS2, brain alignment stems from genuinely learning more human-like concepts during pretraining, while others are highly sensitive to the context provided at inference. Additionally, we find that vision-language encoders are more brain-aligned than more recent, generative VLMs. Altogether, our study shows that VLMs align with human neural representations in concept processing, while highlighting differences among architectures. We open-source code and materials to reproduce our experiments at: https://github.com/dmg-illc/vl-concept-processing.
Paper Structure (37 sections, 6 figures, 8 tables)

This paper contains 37 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview of the experimental setup in the sentence (top) and picture (bottom) condition. Models are fed with the same stimuli participants saw in the fMRI scanner, i.e., concept words appearing in six contexts (provided by either sentences or pictures). Note that contexts are intended to highlight the same word meaning, but may describe different situations (sentences are not image captions). Model representations and brain responses averaged across the six contexts are then used to derive representational dissimilarity matrices (RDMs), storing pairwise cosine distances. Finally, the Spearman correlation between these RDMs provides a measure for model--brain alignment. Best viewed in colour.
  • Figure 2: RSA results for the sentence condition (upper row) and picture condition (lower row). Spearman correlations indicate the alignment between concept representations by models and fMRI responses in the left-hemisphere (LH) language network and in the visual network. Numbers in brackets indicate the model layer from which representations were extracted. Note that the range of the $x$ axes differs between conditions.
  • Figure 3: Initial (as reported in the main experiment) and partial correlations between VLM representations and fMRI responses in the sentence condition. Statistically significant differences (marked by asterisks) between initial and partial correlations indicate that the brain-relevant information captured by the VLM is shared with that present in its language module.
  • Figure 4: Results from the ablation study where we pass only concept words to both VLMs and language-only models. For both brain networks, we show the Spearman correlations resulting from RSA, indicating the alignment between models and fMRI responses from the picture condition. Numbers in brackets indicate the layers from which representations are extracted.
  • Figure 5: Schematic illustrating situations that can be disambiguated by computing partial correlations. If the initial brain alignment of a VLM is attributable to information substantially shared with the language-only module, the partial correlation will be significantly weaker than the initial correlation.
  • ...and 1 more figures