MindFormer: Semantic Alignment of Multi-Subject fMRI for Brain Decoding
Inhwa Han, Jaayeon Lee, Jong Chul Ye
TL;DR
MindFormer tackles the challenge of semantic alignment across subjects in fMRI-based brain decoding by mapping heterogeneous brain signals to compact, semantically meaningful embeddings via per-subject linear mappings and a learnable subject token, then training to align with image features produced by the IP-Adapter in a $16\\times768$ space. The model uses a unified transformer encoder and a feature-domain $L_{1}$ loss together with a contrastive term to maximize alignment with IP-Adapter embeddings ($L_{1}$ and $L_{contrastive}$) and minimize cross-subject bias via a learnable token. Demonstrations on the NSD dataset show semantically consistent image reconstructions across subjects and transferable embeddings for fMRI-to-text generation with an LLM (e.g., OPT-1.3B). Compared with prior multi-subject approaches, MindFormer achieves higher semantic fidelity with a smaller parameter footprint and improved data efficiency, enabling robust decoding even with limited data.
Abstract
Research efforts for visual decoding from fMRI signals have attracted considerable attention in research community. Still multi-subject fMRI decoding with one model has been considered intractable due to the drastic variations in fMRI signals between subjects and even within the same subject across different trials. To address current limitations in multi-subject brain decoding, here we introduce a novel semantic alignment method of multi-subject fMRI signals using so-called MindFormer. This model is specifically designed to generate fMRI-conditioned feature vectors that can be used for conditioning Stable Diffusion model for fMRI- to-image generation or large language model (LLM) for fMRI-to-text generation. More specifically, MindFormer incorporates two key innovations: 1) a subject specific token that effectively capture individual differences in fMRI signals while synergistically combines multi subject fMRI data for training, and 2) a novel feature embedding and training scheme based on the IP-Adapter to extract semantically meaningful features from fMRI signals. Our experimental results demonstrate that MindFormer generates semantically consistent images and text across different subjects. Since our MindFormer maintains semantic fidelity by fully utilizing the training data across different subjects by significantly surpassing existing models in multi-subject brain decoding, this may help deepening our understanding of neural processing variations among individuals.
