Table of Contents
Fetching ...

Improving Multimodal Brain Encoding Model with Dynamic Subject-awareness Routing

Xuanhua Yin, Runkai Zhao, Weidong Cai

TL;DR

This paper tackles multimodal brain encoding under substantial inter-subject variability by introducing AFIRE, a fusion-agnostic interface that standardizes time-aligned post-fusion tokens, and MIND, a sparse, subject-aware Mixture-of-Experts decoder powered by SADGate. The approach decouples fusion from decoding and enables plug-and-play integration with diverse backbones, achieving end-to-end whole-brain predictions. Empirical results on the Algonauts 2025 benchmark show consistent improvements across backbones (TRIBE, ImageBind, Qwen2.5-Omni) and subjects, with stronger cross-subject generalization and interpretable expert routing patterns. These findings suggest that fusion-agnostic interfaces combined with subject-aware MoE decoders offer robust, personalized, and scalable improvements for naturalistic neuroimaging encoding tasks.

Abstract

Naturalistic fMRI encoding must handle multimodal inputs, shifting fusion styles, and pronounced inter-subject variability. We introduce AFIRE (Agnostic Framework for Multimodal fMRI Response Encoding), an agnostic interface that standardizes time-aligned post-fusion tokens from varied encoders, and MIND, a plug-and-play Mixture-of-Experts decoder with a subject-aware dynamic gating. Trained end-to-end for whole-brain prediction, AFIRE decouples the decoder from upstream fusion, while MIND combines token-dependent Top-K sparse routing with a subject prior to personalize expert usage without sacrificing generality. Experiments across multiple multimodal backbones and subjects show consistent improvements over strong baselines, enhanced cross-subject generalization, and interpretable expert patterns that correlate with content type. The framework offers a simple attachment point for new encoders and datasets, enabling robust, plug-and-improve performance for naturalistic neuroimaging studies.

Improving Multimodal Brain Encoding Model with Dynamic Subject-awareness Routing

TL;DR

This paper tackles multimodal brain encoding under substantial inter-subject variability by introducing AFIRE, a fusion-agnostic interface that standardizes time-aligned post-fusion tokens, and MIND, a sparse, subject-aware Mixture-of-Experts decoder powered by SADGate. The approach decouples fusion from decoding and enables plug-and-play integration with diverse backbones, achieving end-to-end whole-brain predictions. Empirical results on the Algonauts 2025 benchmark show consistent improvements across backbones (TRIBE, ImageBind, Qwen2.5-Omni) and subjects, with stronger cross-subject generalization and interpretable expert routing patterns. These findings suggest that fusion-agnostic interfaces combined with subject-aware MoE decoders offer robust, personalized, and scalable improvements for naturalistic neuroimaging encoding tasks.

Abstract

Naturalistic fMRI encoding must handle multimodal inputs, shifting fusion styles, and pronounced inter-subject variability. We introduce AFIRE (Agnostic Framework for Multimodal fMRI Response Encoding), an agnostic interface that standardizes time-aligned post-fusion tokens from varied encoders, and MIND, a plug-and-play Mixture-of-Experts decoder with a subject-aware dynamic gating. Trained end-to-end for whole-brain prediction, AFIRE decouples the decoder from upstream fusion, while MIND combines token-dependent Top-K sparse routing with a subject prior to personalize expert usage without sacrificing generality. Experiments across multiple multimodal backbones and subjects show consistent improvements over strong baselines, enhanced cross-subject generalization, and interpretable expert patterns that correlate with content type. The framework offers a simple attachment point for new encoders and datasets, enabling robust, plug-and-improve performance for naturalistic neuroimaging studies.

Paper Structure

This paper contains 13 sections, 5 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: AFIRE pipeline for fMRI prediction. The brain integrates multimodal information concurrently over time. AFIRE mirrors this process by exposing time-aligned, fusion-agnostic tokens to a subject-aware dynamic decoder (MIND), enabling whole-brain fMRI prediction without backbone-specific tailoring.
  • Figure 2: Overall pipeline.(a) Agnostic Framework for Multimodal fMRI Response Encoding (AFIRE): a thin, fusion-agnostic post-fusion interface that standardizes time-aligned tokens from arbitrary encoders and exposes a shared fused space to the decoder. It decouples fusion specifics from decoding and enables plug-and-play, subject-aware whole-brain prediction. We adopt this design to address the heterogeneity and cross-backbone inconsistency of multimodal fusion. (b) Mixture-of-Experts Integrated Decoder (MIND): a thin, agnostic post-fusion interface that standardizes time-aligned tokens from arbitrary encoders to address the heterogeneity and cross-backbone inconsistency of multimodal fusion. This design clearly separates upstream fusion details from decoding and enables plug-and-play, subject-aware whole-brain prediction: operating directly on the shared fused space, the decoder dynamically partitions and reweights latent subspaces. (c) Subject Prior Router: to model persistent subject preferences and stabilize routing, a global expert-logit vector $\alpha$ and a subject–expert bias matrix $B$ define a prior $\pi(s)$. Top-$K$ selection and normalization produce sparse expert weights $\hat{w}_t$ that adapt across subjects.
  • Figure 3: Parcel-wise prediction–measurement correlation on Friends S6E5. TRIBE, ImageBind, and Qwen2.5-Omni (rows; all with MIND) show similar spatial patterns of Pearson r between predicted and measured fMRI, indicating fusion-agnostic robustness.
  • Figure 4: Subject routing dynamics (first 100 TRs ). Same-episode expert weights for $S_{1}$, $S_{2}$, $S_{3}$, $S_{5}$ (colors denote experts). Weight curves over time indicate subject-specific preferences modulated by token-dependent signals, showing that MIND captures inter-subject variability.