Table of Contents
Fetching ...

The One Where They Brain-Tune for Social Cognition: Multi-Modal Brain-Tuning on Friends

Nico Policzer, Cameron Braunstein, Mariya Toneva

TL;DR

The paper demonstrates that fine-tuning a multimodal audio-video model to the social-cognition region STS can improve alignment to that region and enhance a related social cognition task when the training context is similar to the evaluation data. By using fMRI data from six participants watching Friends, the authors show significant gains in STS and nearby lateral-stream ROIs, and improved sarcasm detection on MUStARD within a related context. However, improvements do not generalize to sentiment/emotion prediction on CMU-MOSEI, suggesting context-specific transfer limitations. This work provides evidence for ROI-targeted brain tuning as a path toward more brain-aligned multimodal AI in social cognition, while highlighting the need for broader datasets and models to achieve wider generalization.

Abstract

Recent studies on audio models show brain-tuning - fine-tuning models to better predict corresponding fMRI activity - improves brain alignment and increases performance on downstream semantic and audio tasks. We extend this approach to a multimodal audio-video model to enhance social cognition, targeting the Superior Temporal Sulcus (STS), a key region for social processing, while subjects watch Friends. We find significant increases in brain alignment to the STS and an adjacent ROI, as well as improvements to a social cognition task related to the training data - sarcasm detection in sitcoms. In summary, our study extends brain-tuning to the multi-modal domain, demonstrating improvements to a downstream task after tuning to a relevant functional region.

The One Where They Brain-Tune for Social Cognition: Multi-Modal Brain-Tuning on Friends

TL;DR

The paper demonstrates that fine-tuning a multimodal audio-video model to the social-cognition region STS can improve alignment to that region and enhance a related social cognition task when the training context is similar to the evaluation data. By using fMRI data from six participants watching Friends, the authors show significant gains in STS and nearby lateral-stream ROIs, and improved sarcasm detection on MUStARD within a related context. However, improvements do not generalize to sentiment/emotion prediction on CMU-MOSEI, suggesting context-specific transfer limitations. This work provides evidence for ROI-targeted brain tuning as a path toward more brain-aligned multimodal AI in social cognition, while highlighting the need for broader datasets and models to achieve wider generalization.

Abstract

Recent studies on audio models show brain-tuning - fine-tuning models to better predict corresponding fMRI activity - improves brain alignment and increases performance on downstream semantic and audio tasks. We extend this approach to a multimodal audio-video model to enhance social cognition, targeting the Superior Temporal Sulcus (STS), a key region for social processing, while subjects watch Friends. We find significant increases in brain alignment to the STS and an adjacent ROI, as well as improvements to a social cognition task related to the training data - sarcasm detection in sitcoms. In summary, our study extends brain-tuning to the multi-modal domain, demonstrating improvements to a downstream task after tuning to a relevant functional region.

Paper Structure

This paper contains 19 sections, 3 equations, 7 figures.

Figures (7)

  • Figure 1: Our audio-video brain-tuning approach. Audio-video stimuli are perceived by the subject, and input to the model, and we fine-tune the model and projection head to better predict corresponding brain activation.
  • Figure 2: a: Average change in alignment to lateral ROIs after brain-tuning over subjects. We find significant increases in the pSTS, aSTS, and LOC. b: Change in alignment before and after tuning on Subject-03. Differences for all subjects can be found in the appendix.
  • Figure 3: Brain-tuned and baseline performance on downstream social perception benchmarks. We find significant improvements on MUSTtARD A2 scores both including Friends clips ($p<0.05$) and omitting them ($p<0.01$).
  • Figure 4: A subject (Subject $5$) has no voxels in the STS above a cross subject prediction accuracy threshold of 0.25, and thus we cannot perform brain-tuning.
  • Figure 5: Differences in Normalized Brain Alignment before and after brain-tuning.
  • ...and 2 more figures