Table of Contents
Fetching ...

BrainVista: Modeling Naturalistic Brain Dynamics as Multimodal Next-Token Prediction

Xuanhua Yin, Runkai Zhao, Lina Yao, Weidong Cai

TL;DR

BrainVista treats naturalistic fMRI as time-aligned, autoregressive forecasting conditioned on past brain states and multimodal stimuli. It introduces Network-wise Tokenizers to respect cortical network structure, a Spatial Mixer Head to regulate cross-network information flow, and Stimulus-to-Brain masking to enforce strict past-only conditioning, enabling stable long-horizon rollout. The method achieves state-of-the-art encoding on Algonauts 2025, CineBrain, and HAD, with reduced drift and improved pattern fidelity at horizons up to $H=20$. By directly addressing timescale mismatch and functional heterogeneity, BrainVista offers more faithful, interpretable simulations of brain dynamics and better cross-subject generalization.

Abstract

Naturalistic fMRI characterizes the brain as a dynamic predictive engine driven by continuous sensory streams. However, modeling the causal forward evolution in realistic neural simulation is impeded by the timescale mismatch between multimodal inputs and the complex topology of cortical networks. To address these challenges, we introduce BrainVista, a multimodal autoregressive framework designed to model the causal evolution of brain states. BrainVista incorporates Network-wise Tokenizers to disentangle system-specific dynamics and a Spatial Mixer Head that captures inter-network information flow without compromising functional boundaries. Furthermore, we propose a novel Stimulus-to-Brain (S2B) masking mechanism to synchronize high-frequency sensory stimuli with hemodynamically filtered signals, enabling strict, history-only causal conditioning. We validate our framework on Algonauts 2025, CineBrain, and HAD, achieving state-of-the-art fMRI encoding performance. In long-horizon rollout settings, our model yields substantial improvements over baselines, increasing pattern correlation by 36.0\% and 33.3\% on relative to the strongest baseline Algonauts 2025 and CineBrain, respectively.

BrainVista: Modeling Naturalistic Brain Dynamics as Multimodal Next-Token Prediction

TL;DR

BrainVista treats naturalistic fMRI as time-aligned, autoregressive forecasting conditioned on past brain states and multimodal stimuli. It introduces Network-wise Tokenizers to respect cortical network structure, a Spatial Mixer Head to regulate cross-network information flow, and Stimulus-to-Brain masking to enforce strict past-only conditioning, enabling stable long-horizon rollout. The method achieves state-of-the-art encoding on Algonauts 2025, CineBrain, and HAD, with reduced drift and improved pattern fidelity at horizons up to . By directly addressing timescale mismatch and functional heterogeneity, BrainVista offers more faithful, interpretable simulations of brain dynamics and better cross-subject generalization.

Abstract

Naturalistic fMRI characterizes the brain as a dynamic predictive engine driven by continuous sensory streams. However, modeling the causal forward evolution in realistic neural simulation is impeded by the timescale mismatch between multimodal inputs and the complex topology of cortical networks. To address these challenges, we introduce BrainVista, a multimodal autoregressive framework designed to model the causal evolution of brain states. BrainVista incorporates Network-wise Tokenizers to disentangle system-specific dynamics and a Spatial Mixer Head that captures inter-network information flow without compromising functional boundaries. Furthermore, we propose a novel Stimulus-to-Brain (S2B) masking mechanism to synchronize high-frequency sensory stimuli with hemodynamically filtered signals, enabling strict, history-only causal conditioning. We validate our framework on Algonauts 2025, CineBrain, and HAD, achieving state-of-the-art fMRI encoding performance. In long-horizon rollout settings, our model yields substantial improvements over baselines, increasing pattern correlation by 36.0\% and 33.3\% on relative to the strongest baseline Algonauts 2025 and CineBrain, respectively.
Paper Structure (18 sections, 14 equations, 7 figures, 10 tables)

This paper contains 18 sections, 14 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Naturalistic fMRI is framed as time-aligned prediction of future brain activity from recent brain history together with sensory context, modeling sequential brain activity by conditioning on past brain tokens and aligned stimuli to recursively simulate future trajectories.
  • Figure 2: Framework of BrainVista. Time-aligned multimodal features are aggregated into stimuli tokens $\tau_s^t$. Parcel-wise fMRI is encoded and decoded by network-specific tokenizers pretrained via self-reconstruction into fMRI circuit tokens $\tau_f^t$, to preserve functional specificity and support cross-subject alignment. An interleaved, time-aligned token sequence is fed into BrainVista to forecast future fMRI circuit tokens and decode them back to fMRI, pairing stimulus context with the corresponding brain state at each time step. Details of the Stimulus-to-Brain masking and the Spatial Mixer Head are illustrated in Fig. \ref{['fig:arch']}.
  • Figure 3: BrainVista core: a temporal causal Transformer with the Stimulus-to-Brain causal masking, followed by a Spatial Mixer Head (network-wise attention) to model cross-network interactions before producing next-step fMRI circuit-token predictions.
  • Figure 4: Model-based analyses on Algonauts 2025 aggregated to Yeo-7. Shown for sub01 and sub03. (a) Whole-brain rollout fidelity ($p_{\mathrm{corr}}$, $H{=}10$) mapped to cortical surfaces (dorsal/ventral/lateral). (b) Network-wise fidelity by aggregating ROIs into Yeo-7. (c) We set one history token to zero at a time and measure the increase in $L_2$ error, where larger values indicate stronger reliance on that token. (d) Spatial Mixer attention reveals dominant cross-network pathways. We aggregate attention to Yeo-7 network pairs and average across time. Only connections with mean attention $\ge 0.2$ are shown. Chord thickness scales with coupling strength.
  • Figure 5: Ablations on Algonauts 2025. (a) Tokenization granularity: We compare three input representations: continuous fMRI without tokenization, a shared full-brain tokenizer, and our Network-wise Tokenizers. (b) Temporal-causality masking: We compare bidirectional, standard causal, and S2B-causal attention, evaluated by $p_{\mathrm{corr}}$ at rollout horizons $H\in{1,10,20}$.
  • ...and 2 more figures