The One Where They Brain-Tune for Social Cognition: Multi-Modal Brain-Tuning on Friends
Nico Policzer, Cameron Braunstein, Mariya Toneva
TL;DR
The paper demonstrates that fine-tuning a multimodal audio-video model to the social-cognition region STS can improve alignment to that region and enhance a related social cognition task when the training context is similar to the evaluation data. By using fMRI data from six participants watching Friends, the authors show significant gains in STS and nearby lateral-stream ROIs, and improved sarcasm detection on MUStARD within a related context. However, improvements do not generalize to sentiment/emotion prediction on CMU-MOSEI, suggesting context-specific transfer limitations. This work provides evidence for ROI-targeted brain tuning as a path toward more brain-aligned multimodal AI in social cognition, while highlighting the need for broader datasets and models to achieve wider generalization.
Abstract
Recent studies on audio models show brain-tuning - fine-tuning models to better predict corresponding fMRI activity - improves brain alignment and increases performance on downstream semantic and audio tasks. We extend this approach to a multimodal audio-video model to enhance social cognition, targeting the Superior Temporal Sulcus (STS), a key region for social processing, while subjects watch Friends. We find significant increases in brain alignment to the STS and an adjacent ROI, as well as improvements to a social cognition task related to the training data - sarcasm detection in sitcoms. In summary, our study extends brain-tuning to the multi-modal domain, demonstrating improvements to a downstream task after tuning to a relevant functional region.
