Table of Contents
Fetching ...

Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies (Algonauts 2025)

Semih Eren, Deniz Kucukahmetler, Nico Scherf

TL;DR

This work addresses predicting time-resolved brain responses to naturalistic movies by integrating visual, auditory, and linguistic information through a hierarchical multimodal recurrent ensemble. It combines per-modality feature extraction from large pretrained models, fusion via bi-directional RNNs followed by a post-fusion RNN, and subject-specific output heads, trained with a curriculum loss that emphasizes early sensory to late association regions, plus a 100-model ensemble for robustness. The approach achieves an overall Pearson correlation of $r=0.2094$ and a peak parcel score of $r=0.63$, performing robustly across subjects and setting a practical baseline for future multimodal brain-encoding benchmarks. The results highlight the value of curriculum-based training and diverse ensembles in aligning complex, temporally extended brain activity with rich multimodal stimuli, while also pointing to limitations in prefrontal regions and transformer backbones for this task.

Abstract

Accurately predicting distributed cortical responses to naturalistic stimuli requires models that integrate visual, auditory and semantic information over time. We present a hierarchical multimodal recurrent ensemble that maps pretrained video, audio, and language embeddings to fMRI time series recorded while four subjects watched almost 80 hours of movies provided by the Algonauts 2025 challenge. Modality-specific bidirectional RNNs encode temporal dynamics; their hidden states are fused and passed to a second recurrent layer, and lightweight subject-specific heads output responses for 1000 cortical parcels. Training relies on a composite MSE-correlation loss and a curriculum that gradually shifts emphasis from early sensory to late association regions. Averaging 100 model variants further boosts robustness. The resulting system ranked third on the competition leaderboard, achieving an overall Pearson r = 0.2094 and the highest single-parcel peak score (mean r = 0.63) among all participants, with particularly strong gains for the most challenging subject (Subject 5). The approach establishes a simple, extensible baseline for future multimodal brain-encoding benchmarks.

Multimodal Recurrent Ensembles for Predicting Brain Responses to Naturalistic Movies (Algonauts 2025)

TL;DR

This work addresses predicting time-resolved brain responses to naturalistic movies by integrating visual, auditory, and linguistic information through a hierarchical multimodal recurrent ensemble. It combines per-modality feature extraction from large pretrained models, fusion via bi-directional RNNs followed by a post-fusion RNN, and subject-specific output heads, trained with a curriculum loss that emphasizes early sensory to late association regions, plus a 100-model ensemble for robustness. The approach achieves an overall Pearson correlation of and a peak parcel score of , performing robustly across subjects and setting a practical baseline for future multimodal brain-encoding benchmarks. The results highlight the value of curriculum-based training and diverse ensembles in aligning complex, temporally extended brain activity with rich multimodal stimuli, while also pointing to limitations in prefrontal regions and transformer backbones for this task.

Abstract

Accurately predicting distributed cortical responses to naturalistic stimuli requires models that integrate visual, auditory and semantic information over time. We present a hierarchical multimodal recurrent ensemble that maps pretrained video, audio, and language embeddings to fMRI time series recorded while four subjects watched almost 80 hours of movies provided by the Algonauts 2025 challenge. Modality-specific bidirectional RNNs encode temporal dynamics; their hidden states are fused and passed to a second recurrent layer, and lightweight subject-specific heads output responses for 1000 cortical parcels. Training relies on a composite MSE-correlation loss and a curriculum that gradually shifts emphasis from early sensory to late association regions. Averaging 100 model variants further boosts robustness. The resulting system ranked third on the competition leaderboard, achieving an overall Pearson r = 0.2094 and the highest single-parcel peak score (mean r = 0.63) among all participants, with particularly strong gains for the most challenging subject (Subject 5). The approach establishes a simple, extensible baseline for future multimodal brain-encoding benchmarks.

Paper Structure

This paper contains 16 sections, 6 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: a. Method Illustration. b. Final Model Ensemble Components.
  • Figure 2: (a) Glass‑brain rendering of the Schaefer‑2018 1000‑parcel atlas, highlighting seven functional networks: Visual (Vis), Somatomotor (SomMot), Dorsal Attention (DosAttn), Salience/Ventral Attention (SalVentAttn), Limbic, Frontoparietal (Cont), and Default. Smaller networks are rendered last to ensure full visibility, and the legend maps each color to its corresponding network. (b) Ablation study comparing three variants—excluding multi‑head LSTM ("Single Head"), excluding multi‑modal LSTM ("Unified‑Modality"), and excluding both ("Unified‑Modality and Single Head")—against our proposed Modality‑Separated Head model. (c) Performance metrics for the final models and their ensemble.