Task-Conditioned Probing Reveals Brain-Alignment Patterns in Instruction-Tuned Multimodal LLMs
Subba Reddy Oota, Khushbu Pahwa, Prachi Jindal, Satya Sai Srinath Namburi, Maneesh Singh, Tanmoy Chakraborty, Bapi S. Raju, Manish Gupta
TL;DR
The paper examines how instruction-tuning shapes brain-aligned representations in instruction-tuned multimodal LLMs (IT-MLLMs) under naturalistic video-audio stimuli. Using ridge-regression encoding, it maps instruction-specific embeddings from six video IT-MLLMs and two audio IT-MLLMs to fMRI responses collected during Movie10, comparing against in-context learning, non-instruction-tuned multimodal, and unimodal baselines. The results show that IT-MLLMs—especially video-based ones—achieve stronger brain alignment and exhibit task-conditioned representations, with weak coupling to surface semantics (unlike ICL models that strongly track semantic prompts) and clear layer-to-brain hierarchical mappings. These findings advance our understanding of how task instructions shape joint information processing in the brain and IT-MLLMs, suggesting instructions can serve as controlled probes for cognitive neuroscience and brain-inspired AI design.
Abstract
Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models. More recently, instruction-tuned multimodal (IT) models have been shown to generate task-specific representations that align strongly with brain activity, yet most prior evaluations focus on unimodal stimuli or non-instruction-tuned models under multimodal stimuli. We still lack a clear understanding of whether instruction-tuning is associated with IT-MLLMs organizing their representations around functional task demands or if they simply reflect surface semantics. To address this, we estimate brain alignment by predicting fMRI responses recorded during naturalistic movie watching (video with audio) from MLLM representations. Using instruction-specific embeddings from six video and two audio IT-MLLMs, across 13 video task instructions, we find that instruction-tuned video MLLMs significantly outperform in-context learning (ICL) multimodal models (~9%), non-instruction-tuned multimodal models (~15%), and unimodal baselines (~20%). Our evaluation of MLLMs across video and audio tasks, and language-guided probing produces distinct task-specific MLLM representations that vary across brain regions. We also find that ICL models show strong semantic organization (r=0.78), while IT models show weak coupling to instruction-text semantics (r=0.14), consistent with task-conditioned subspaces associated with higher brain alignment. These findings are consistent with an association between task-specific instructions and stronger brain-MLLM alignment, and open new avenues for mapping joint information processing in both systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].
