Table of Contents
Fetching ...

Stacked Regression using Off-the-shelf, Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report)

Robert Scholz, Kunal Bagga, Christine Ahrends, Carlo Alberto Barbano

TL;DR

This work tackles predicting fMRI brain responses to movie stimuli by building a multimodal encoding pipeline that integrates text, audio, video, and vision-language representations from off-the-shelf and fine-tuned models. It combines enhanced transcripts, stimulus-tuning, and selective fine-tuning with stacked regression to fuse predictions across models, achieving competitive results in the Algonauts 2025 challenge (final out-of-distribution average $r=0.1496$ across four subjects) and highlighting the value of cross-modal information and ensembling. Key findings include strong gains from slow_r50 fine-tuning ($r_{f6}=0.178$) on a per-subject basis, the effectiveness of Librarian-like transcript enhancements with InternVL3, and the superiority of ensemble approaches over single-model predictions. The work provides open-source code and resources, contributing to a scalable framework for multimodal brain encoding and informing future directions in end-to-end non-linear fusion and cross-subject transfer learning.

Abstract

We present our submission to the Algonauts 2025 Challenge, where the goal is to predict fMRI brain responses to movie stimuli. Our approach integrates multimodal representations from large language models, video encoders, audio models, and vision-language models, combining both off-the-shelf and fine-tuned variants. To improve performance, we enhanced textual inputs with detailed transcripts and summaries, and we explored stimulus-tuning and fine-tuning strategies for language and vision models. Predictions from individual models were combined using stacked regression, yielding solid results. Our submission, under the team name Seinfeld, ranked 10th. We make all code and resources publicly available, contributing to ongoing efforts in developing multimodal encoding models for brain activity.

Stacked Regression using Off-the-shelf, Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report)

TL;DR

This work tackles predicting fMRI brain responses to movie stimuli by building a multimodal encoding pipeline that integrates text, audio, video, and vision-language representations from off-the-shelf and fine-tuned models. It combines enhanced transcripts, stimulus-tuning, and selective fine-tuning with stacked regression to fuse predictions across models, achieving competitive results in the Algonauts 2025 challenge (final out-of-distribution average across four subjects) and highlighting the value of cross-modal information and ensembling. Key findings include strong gains from slow_r50 fine-tuning () on a per-subject basis, the effectiveness of Librarian-like transcript enhancements with InternVL3, and the superiority of ensemble approaches over single-model predictions. The work provides open-source code and resources, contributing to a scalable framework for multimodal brain encoding and informing future directions in end-to-end non-linear fusion and cross-subject transfer learning.

Abstract

We present our submission to the Algonauts 2025 Challenge, where the goal is to predict fMRI brain responses to movie stimuli. Our approach integrates multimodal representations from large language models, video encoders, audio models, and vision-language models, combining both off-the-shelf and fine-tuned variants. To improve performance, we enhanced textual inputs with detailed transcripts and summaries, and we explored stimulus-tuning and fine-tuning strategies for language and vision models. Predictions from individual models were combined using stacked regression, yielding solid results. Our submission, under the team name Seinfeld, ranked 10th. We make all code and resources publicly available, contributing to ongoing efforts in developing multimodal encoding models for brain activity.

Paper Structure

This paper contains 26 sections, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of the brain encoding approach and the competition structure. (A) General fMRI brain activity encoding pipeline. (B) The two main stages of the Algonauts 2025 Challenge, along with the respective training and test datasets.
  • Figure 2: Overview of prediction sources and stacking approach used in our final submission.
  • Figure S1: Enter Caption
  • Figure S2: a) Training loss by epoch for fine tuning over Friends seasons 1 to 5. b) Pearson r score when fine tuning slow_r50 per subject vs fine tuning for all subjects.