Table of Contents
Fetching ...

fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding

Yuxiang Wei, Yanteng Zhang, Xi Xiao, Chengxuan Qian, Tianyang Wang, Vince D. Calhoun

TL;DR

Problem: bridging fMRI with language to build universal brain representations. Approach: a three-stage framework (fMRI tokenizer, LLM alignment, multi-task instruction tuning) trained on UKB/ABCD resting-state data and a synthetic fMRI–text descriptor corpus grounding low-level brain organization in language. Contributions: (i) a large descriptive corpus translating imaging features into textual descriptors; (ii) a text-aligned fMRI tokenizer mapping fMRI to discrete tokens in a language space; (iii) LLM fine-tuning with temporal modeling and multi-task instruction tuning; (iv) strong zero-shot and few-shot generalization with LoRA-enabled parameter efficiency. Impact: enables scalable, language-grounded brain modeling across datasets and tasks.

Abstract

Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI-text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.

fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding

TL;DR

Problem: bridging fMRI with language to build universal brain representations. Approach: a three-stage framework (fMRI tokenizer, LLM alignment, multi-task instruction tuning) trained on UKB/ABCD resting-state data and a synthetic fMRI–text descriptor corpus grounding low-level brain organization in language. Contributions: (i) a large descriptive corpus translating imaging features into textual descriptors; (ii) a text-aligned fMRI tokenizer mapping fMRI to discrete tokens in a language space; (iii) LLM fine-tuning with temporal modeling and multi-task instruction tuning; (iv) strong zero-shot and few-shot generalization with LoRA-enabled parameter efficiency. Impact: enables scalable, language-grounded brain modeling across datasets and tasks.

Abstract

Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI-text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.

Paper Structure

This paper contains 12 sections, 5 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: The proposed fMRI-LM outperforms baselines on diverse tasks. fMRI-LM demonstrates comprehensive and powerful performance.
  • Figure 2: Overview of the fMRI tokenizer, which consists of a Transformer-based encoder and a vector quantizer. The tokenizer is trained with reconstruction, domain-adversarial, and contrastive alignment losses to align fMRI representations with the LLM’s text-embedding space.
  • Figure 3: Descriptors' predictive strength over UKB sex. Using all descriptors ("All Desc") can achieve about 70% accuracy.
  • Figure 4: Overall training pipeline of fMRI-LM. (a) fMRI–text pairs are constructed from four types of features: functional connectivity, graph metrics, functional gradients, and ICA-based components. (b) Stage 1: align the fMRI tokenizer with the frozen text embedding space. (c) Stage 2: tune a pretrained LLM to generate linguistic or temporal representations conditioned on fMRI tokens. Use either full fine-tuning or LoRA hu2022lora. (d) Stage 3: multi-task multi-paradigm instruction tuning for downstream tasks. High-level descriptions are used as optional input for enhanced performance.
  • Figure 5: Three paradigms for instruction tuning
  • ...and 5 more figures