Vision-Language Models Align with Human Neural Representations in Concept Processing
Anna Bavaresco, Marianne de Heer Kloots, Sandro Pezzelle, Raquel Fernández
TL;DR
This study systematically probes whether vision-language modalities align with human neural representations of concepts, examining architecture families and the role of contextual input. Using Representational Similarity Analysis ($RSA$) against fMRI data from the Pereira dataset under sentence and picture contexts, the authors compare ten off-the-shelf models (six VLMs across three families and language-only baselines) and perform two ablation analyses to disentangle semantic vs visual contributions. The findings show that vision-language encoders typically achieve stronger brain alignment than unimodal or generative counterparts, and that genuine multimodal pretraining—not just inference-time input—drives this alignment, with notable differences across the left-language network and the visual network. These results inform the design of human-aligned multimodal models and advance neuroAI perspectives on multimodal grounding, while acknowledging architecture- and context-dependent limitations and pointing to future refinements.
Abstract
Recent studies suggest that transformer-based vision-language models (VLMs) capture the multimodality of concept processing in the human brain. However, a systematic evaluation exploring different types of VLM architectures and the role played by visual and textual context is still lacking. Here, we analyse multiple VLMs employing different strategies to integrate visual and textual modalities, along with language-only counterparts. We measure the alignment between concept representations by models and existing (fMRI) brain responses to concept words presented in two experimental conditions, where either visual (pictures) or textual (sentences) context is provided. Our results reveal that VLMs outperform the language-only counterparts in both experimental conditions. However, controlled ablation studies show that only for some VLMs, such as LXMERT and IDEFICS2, brain alignment stems from genuinely learning more human-like concepts during pretraining, while others are highly sensitive to the context provided at inference. Additionally, we find that vision-language encoders are more brain-aligned than more recent, generative VLMs. Altogether, our study shows that VLMs align with human neural representations in concept processing, while highlighting differences among architectures. We open-source code and materials to reproduce our experiments at: https://github.com/dmg-illc/vl-concept-processing.
