Table of Contents
Fetching ...

Task-Conditioned Probing Reveals Brain-Alignment Patterns in Instruction-Tuned Multimodal LLMs

Subba Reddy Oota, Khushbu Pahwa, Prachi Jindal, Satya Sai Srinath Namburi, Maneesh Singh, Tanmoy Chakraborty, Bapi S. Raju, Manish Gupta

TL;DR

The paper examines how instruction-tuning shapes brain-aligned representations in instruction-tuned multimodal LLMs (IT-MLLMs) under naturalistic video-audio stimuli. Using ridge-regression encoding, it maps instruction-specific embeddings from six video IT-MLLMs and two audio IT-MLLMs to fMRI responses collected during Movie10, comparing against in-context learning, non-instruction-tuned multimodal, and unimodal baselines. The results show that IT-MLLMs—especially video-based ones—achieve stronger brain alignment and exhibit task-conditioned representations, with weak coupling to surface semantics (unlike ICL models that strongly track semantic prompts) and clear layer-to-brain hierarchical mappings. These findings advance our understanding of how task instructions shape joint information processing in the brain and IT-MLLMs, suggesting instructions can serve as controlled probes for cognitive neuroscience and brain-inspired AI design.

Abstract

Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models. More recently, instruction-tuned multimodal (IT) models have been shown to generate task-specific representations that align strongly with brain activity, yet most prior evaluations focus on unimodal stimuli or non-instruction-tuned models under multimodal stimuli. We still lack a clear understanding of whether instruction-tuning is associated with IT-MLLMs organizing their representations around functional task demands or if they simply reflect surface semantics. To address this, we estimate brain alignment by predicting fMRI responses recorded during naturalistic movie watching (video with audio) from MLLM representations. Using instruction-specific embeddings from six video and two audio IT-MLLMs, across 13 video task instructions, we find that instruction-tuned video MLLMs significantly outperform in-context learning (ICL) multimodal models (~9%), non-instruction-tuned multimodal models (~15%), and unimodal baselines (~20%). Our evaluation of MLLMs across video and audio tasks, and language-guided probing produces distinct task-specific MLLM representations that vary across brain regions. We also find that ICL models show strong semantic organization (r=0.78), while IT models show weak coupling to instruction-text semantics (r=0.14), consistent with task-conditioned subspaces associated with higher brain alignment. These findings are consistent with an association between task-specific instructions and stronger brain-MLLM alignment, and open new avenues for mapping joint information processing in both systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].

Task-Conditioned Probing Reveals Brain-Alignment Patterns in Instruction-Tuned Multimodal LLMs

TL;DR

The paper examines how instruction-tuning shapes brain-aligned representations in instruction-tuned multimodal LLMs (IT-MLLMs) under naturalistic video-audio stimuli. Using ridge-regression encoding, it maps instruction-specific embeddings from six video IT-MLLMs and two audio IT-MLLMs to fMRI responses collected during Movie10, comparing against in-context learning, non-instruction-tuned multimodal, and unimodal baselines. The results show that IT-MLLMs—especially video-based ones—achieve stronger brain alignment and exhibit task-conditioned representations, with weak coupling to surface semantics (unlike ICL models that strongly track semantic prompts) and clear layer-to-brain hierarchical mappings. These findings advance our understanding of how task instructions shape joint information processing in the brain and IT-MLLMs, suggesting instructions can serve as controlled probes for cognitive neuroscience and brain-inspired AI design.

Abstract

Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models. More recently, instruction-tuned multimodal (IT) models have been shown to generate task-specific representations that align strongly with brain activity, yet most prior evaluations focus on unimodal stimuli or non-instruction-tuned models under multimodal stimuli. We still lack a clear understanding of whether instruction-tuning is associated with IT-MLLMs organizing their representations around functional task demands or if they simply reflect surface semantics. To address this, we estimate brain alignment by predicting fMRI responses recorded during naturalistic movie watching (video with audio) from MLLM representations. Using instruction-specific embeddings from six video and two audio IT-MLLMs, across 13 video task instructions, we find that instruction-tuned video MLLMs significantly outperform in-context learning (ICL) multimodal models (~9%), non-instruction-tuned multimodal models (~15%), and unimodal baselines (~20%). Our evaluation of MLLMs across video and audio tasks, and language-guided probing produces distinct task-specific MLLM representations that vary across brain regions. We also find that ICL models show strong semantic organization (r=0.78), while IT models show weak coupling to instruction-text semantics (r=0.14), consistent with task-conditioned subspaces associated with higher brain alignment. These findings are consistent with an association between task-specific instructions and stronger brain-MLLM alignment, and open new avenues for mapping joint information processing in both systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].

Paper Structure

This paper contains 31 sections, 35 figures, 21 tables.

Figures (35)

  • Figure 1: Leveraging instruction-tuned multimodal video and audio models for brain encoding with a diverse set of instructions. For the given movie clip, we can obtain different multimodal representations using instructions that ask the model to (i) generate the caption of the video, (ii) identify whether temporal events are present, (iii) determine the primary colors dominant in the video, etc. Using instruction-specific representations (X), we estimate the alignment using a simple linear function $f$ (ridge regression), which maps MLLM representations to brain recordings. Here, W denotes voxelwise encoding model weights.
  • Figure 2: Average normalized brain alignment of instruction-tuned video MLLMs vs instruction-tuned audio MLLMs vs in-context learning video MLLMs vs multimodal and unimodal models across whole brain, language, visual and auditory regions. Error bars indicate the standard error of the mean across participants. $*$ implies that instruction-tuned MLLM embeddings are significantly better than multimodal models and $\wedge$ means that instruction-tuned MLLM embeddings are significantly better unimodal models with p$\leq 0.05$. IT: Instruction-tuned, IC: In-context learning
  • Figure 3: Each voxel is color-coded with the instruction that led to the highest normalized brain alignment. The color bar highlights color codes for each instruction. The voxels are projected onto the flattened cortical surface of the 'fsaverage' subject. (Left): video MLLM (Qwen-2.5-VL). (Right): audio MLLM (Qwen-2.5-Audio).
  • Figure 4: (a) Qwen-2.5-VL-7B Instruct, (b) Qwen-2.5-VL-3B Instruct and (c) Qwen-2.5-VL-7B Instruct non-natural language prompt (layer-wise alignment): Each voxel is color coded with the MLLM layer number (out of 29) that led to the highest normalized brain alignment. The color bar highlights color codes for each layer. The voxels are projected onto the flattened cortical surface of average across subjects on 'fsaverage' surface.
  • Figure 5: InternVL: Normalized brain alignment was computed before vs. after instruction tuning: Using brain predictions across layers for InternVL-8B-Instruct and InternVL-8B models.
  • ...and 30 more figures