Table of Contents
Fetching ...

BrainSymphony: A parameter-efficient multimodal foundation model for brain dynamics with limited data

Moein Khajehnejad, Forough Habibollahi, Devon Stoliker, Adeel Razi

TL;DR

BrainSymphony introduces a parameter-efficient multimodal foundation model that unifies fMRI dynamics with diffusion-derived structural connectivity through a modular architecture. By coupling a Spatio–Temporal fMRI encoder (Spatial and Temporal Transformers plus a 1D context path) with a Signed Graph Transformer for SC and an adaptive fusion gate, it delivers state-of-the-art performance with only 5.6M parameters, substantially reducing data and compute requirements. The model achieves high-fidelity reconstructions, recovers canonical networks in an unsupervised manner, and generalizes to an external psychedelic dataset, where attention maps provide mechanistic interpretations of drug- and state-dependent brain reorganization. The work highlights how architecture-informed multimodal models can surpass larger baselines while offering interpretability and potential clinical applicability, paving the way for accessible AI in neuroscience.

Abstract

Foundation models are transforming neuroscience but are often prohibitively large, data-hungry, and difficult to deploy. Here, we introduce BrainSymphony, a lightweight and parameter-efficient foundation model with plug-and-play integration of fMRI time series and diffusion-derived structural connectivity, allowing unimodal or multimodal training and deployment without architectural changes while requiring substantially less data compared to the state-of-the-art. The model processes fMRI time series through parallel spatial and temporal transformer streams, distilled into compact embeddings by a Perceiver module, while a novel signed graph transformer encodes anatomical connectivity from diffusion MRI. These complementary representations are then combined through an adaptive fusion mechanism. Despite its compact design, BrainSymphony consistently outperforms larger models on benchmarks spanning prediction, classification, and unsupervised network discovery. Highlighting the model's generalizability and interpretability, attention maps reveal drug-induced context-dependent reorganization of cortical hierarchies in an independent psilocybin neuroimaging dataset. BrainSymphony delivers accessible, interpretable, and clinically meaningful results and demonstrates that architecturally informed, multimodal models can surpass much larger counterparts and advance applications of AI in neuroscience.

BrainSymphony: A parameter-efficient multimodal foundation model for brain dynamics with limited data

TL;DR

BrainSymphony introduces a parameter-efficient multimodal foundation model that unifies fMRI dynamics with diffusion-derived structural connectivity through a modular architecture. By coupling a Spatio–Temporal fMRI encoder (Spatial and Temporal Transformers plus a 1D context path) with a Signed Graph Transformer for SC and an adaptive fusion gate, it delivers state-of-the-art performance with only 5.6M parameters, substantially reducing data and compute requirements. The model achieves high-fidelity reconstructions, recovers canonical networks in an unsupervised manner, and generalizes to an external psychedelic dataset, where attention maps provide mechanistic interpretations of drug- and state-dependent brain reorganization. The work highlights how architecture-informed multimodal models can surpass larger baselines while offering interpretability and potential clinical applicability, paving the way for accessible AI in neuroscience.

Abstract

Foundation models are transforming neuroscience but are often prohibitively large, data-hungry, and difficult to deploy. Here, we introduce BrainSymphony, a lightweight and parameter-efficient foundation model with plug-and-play integration of fMRI time series and diffusion-derived structural connectivity, allowing unimodal or multimodal training and deployment without architectural changes while requiring substantially less data compared to the state-of-the-art. The model processes fMRI time series through parallel spatial and temporal transformer streams, distilled into compact embeddings by a Perceiver module, while a novel signed graph transformer encodes anatomical connectivity from diffusion MRI. These complementary representations are then combined through an adaptive fusion mechanism. Despite its compact design, BrainSymphony consistently outperforms larger models on benchmarks spanning prediction, classification, and unsupervised network discovery. Highlighting the model's generalizability and interpretability, attention maps reveal drug-induced context-dependent reorganization of cortical hierarchies in an independent psilocybin neuroimaging dataset. BrainSymphony delivers accessible, interpretable, and clinically meaningful results and demonstrates that architecturally informed, multimodal models can surpass much larger counterparts and advance applications of AI in neuroscience.

Paper Structure

This paper contains 37 sections, 45 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Architecture of BrainSymphony.a) Three parallel streams encode the fMRI time series: a Spatial Transformer captures relationships between brain regions, a Temporal Transformer models dynamics over time, and a 1D-CNN extracts local features from the signal context. b) The outputs from the three fMRI encoder streams are fed into a Perceiver module. This module uses cross-attention against a set of learned latents to distill the rich fMRI information into a compact, fixed-size representation. c, In parallel, a Signed Graph Transformer encodes the structural connectome. A learned gate adaptively fuses functional and structural embeddings for downstream prediction.
  • Figure 2: Performance of BrainSymphony across modalities, tasks, and comparison with baselines.a–b) Gender classification: fine-tuned fusion achieves the highest accuracy (94.0%) and F1 score ($0.933$). c–d) Age prediction: fusion yields the lowest MSE ($0.363$) and highest correlation ($\rho=0.841$). e) Radar chart of peak performance (MSE inverted for visualization), showing consistent superiority of the fine-tuned multimodal model across tasks. f–g) Gender classification performance of competing models (ACC, F1). h–i) Age prediction performance of competing models (MSE, $\rho$). j) Overall performance summary: BrainSymphony (fusion) attains the best scores across all metrics while using only $5.6$M parameters, substantially fewer than BrainLM ($111$M) and Brain-JEPA ($85$M). k) A detailed breakdown of BrainSymphony's performance across different modalities and evaluation protocols. l) Comparison of our proposed BrainSymphony model against baselines, showing key performance metrics alongside model parameter co
  • Figure 3: Functional network identification and multimodal reconstruction of brain activity and connectivity.a) Reconstructed spatial activation patterns for a sample subject at representative time points from the temporal transformer. b) Reconstructed BOLD time courses for sample ROIs, closely tracking the original signals from the spatial transformer. c) Original versus reconstructed structural connectivity (with connection weights plotted on a logarithmic scale) for two sample subjects, preserving key modular topology. d) Distribution of BOLD time series reconstruction accuracy ($R^2$) from the Perceiver block, showing strong performance across ROIs. e) Multi-faceted evaluation of structural connectivity reconstruction, showing high per-subject pattern correlation ($\rho = 0.818 \pm 0.028$, left), accurate prediction of individual edge weights (middle), and an empirically-matched distribution of reconstructed weights (right). f) Ground truth assignment of 400 ROIs to seven canonical networks. g) Network assignments as predicted by the model's classifier. h) Spatial map highlighting misclassified ROIs (black $\times$). i) Row-normalized confusion matrix showing high per-network accuracy. j) Overall classification accuracy, with BrainSymphony (84.4%) outperforming comparison models and chance level (14.3%).
  • Figure 4: Reconstruction fidelity and attention mapping of BrainSymphony on the external Psiconnect dataset.a) Reconstruction performance summarized across conditions (Rest, Meditation, Music, Movie) and contexts (Admin vs. Baseline). Paired dot plots compare real ROI time series (purple, teal, grey) with randomly permuted controls, showing that BrainSymphony consistently achieves higher $R^2$, $\rho$, and lower MAE. b) Circos plots of attention weights (Admin--Baseline difference) for the four conditions, showing the top 500 strongest edges, averaged over all subjects. Each edge is coloured according to the source network, such that all outgoing connections from a given network share the same colour, indicating origin of the attentional query (and thus which network is most strongly integrated with the system). An inner bar track, concentric with the network arcs, depicts the total outgoing attention, with the radial height proportional to this value for each network. c) Average receptive (incoming) attention to each ROI (Admin–Baseline difference), computed as the mean of all attention weights directed to the ROI, normalized by the total number of ROIs and averaged across subjects. Bars are colored by the ROI’s network affiliation, illustrating preferential drivers of network-wide influence (regions receiving the strongest incoming attention). d) Network-level averages of incoming attention weights of subplot (c), obtained by pooling across ROIs within each canonical network.
  • Figure 5: Psilocybin-related changes in receptive (incoming) attention across MEQ subgroups and conditions. Top row: reference flatmaps showing the seven canonical functional networks. Remaining rows: difference maps (Psilocybin – Baseline) in receptive attention for subjects in the top (High MEQ, left) and bottom (Low MEQ, right) deciles of MEQ30 scores. Each row corresponds to one condition from the PsiConnect dataset (rest, meditation, music, movie). Heatmaps (center) display block-averaged changes between networks, derived from the same subjects, providing a compact summary of psilocybin-related differences in inter-network receptive attention.
  • ...and 8 more figures