Table of Contents
Fetching ...

MindFormer: Semantic Alignment of Multi-Subject fMRI for Brain Decoding

Inhwa Han, Jaayeon Lee, Jong Chul Ye

TL;DR

MindFormer tackles the challenge of semantic alignment across subjects in fMRI-based brain decoding by mapping heterogeneous brain signals to compact, semantically meaningful embeddings via per-subject linear mappings and a learnable subject token, then training to align with image features produced by the IP-Adapter in a $16\\times768$ space. The model uses a unified transformer encoder and a feature-domain $L_{1}$ loss together with a contrastive term to maximize alignment with IP-Adapter embeddings ($L_{1}$ and $L_{contrastive}$) and minimize cross-subject bias via a learnable token. Demonstrations on the NSD dataset show semantically consistent image reconstructions across subjects and transferable embeddings for fMRI-to-text generation with an LLM (e.g., OPT-1.3B). Compared with prior multi-subject approaches, MindFormer achieves higher semantic fidelity with a smaller parameter footprint and improved data efficiency, enabling robust decoding even with limited data.

Abstract

Research efforts for visual decoding from fMRI signals have attracted considerable attention in research community. Still multi-subject fMRI decoding with one model has been considered intractable due to the drastic variations in fMRI signals between subjects and even within the same subject across different trials. To address current limitations in multi-subject brain decoding, here we introduce a novel semantic alignment method of multi-subject fMRI signals using so-called MindFormer. This model is specifically designed to generate fMRI-conditioned feature vectors that can be used for conditioning Stable Diffusion model for fMRI- to-image generation or large language model (LLM) for fMRI-to-text generation. More specifically, MindFormer incorporates two key innovations: 1) a subject specific token that effectively capture individual differences in fMRI signals while synergistically combines multi subject fMRI data for training, and 2) a novel feature embedding and training scheme based on the IP-Adapter to extract semantically meaningful features from fMRI signals. Our experimental results demonstrate that MindFormer generates semantically consistent images and text across different subjects. Since our MindFormer maintains semantic fidelity by fully utilizing the training data across different subjects by significantly surpassing existing models in multi-subject brain decoding, this may help deepening our understanding of neural processing variations among individuals.

MindFormer: Semantic Alignment of Multi-Subject fMRI for Brain Decoding

TL;DR

MindFormer tackles the challenge of semantic alignment across subjects in fMRI-based brain decoding by mapping heterogeneous brain signals to compact, semantically meaningful embeddings via per-subject linear mappings and a learnable subject token, then training to align with image features produced by the IP-Adapter in a space. The model uses a unified transformer encoder and a feature-domain loss together with a contrastive term to maximize alignment with IP-Adapter embeddings ( and ) and minimize cross-subject bias via a learnable token. Demonstrations on the NSD dataset show semantically consistent image reconstructions across subjects and transferable embeddings for fMRI-to-text generation with an LLM (e.g., OPT-1.3B). Compared with prior multi-subject approaches, MindFormer achieves higher semantic fidelity with a smaller parameter footprint and improved data efficiency, enabling robust decoding even with limited data.

Abstract

Research efforts for visual decoding from fMRI signals have attracted considerable attention in research community. Still multi-subject fMRI decoding with one model has been considered intractable due to the drastic variations in fMRI signals between subjects and even within the same subject across different trials. To address current limitations in multi-subject brain decoding, here we introduce a novel semantic alignment method of multi-subject fMRI signals using so-called MindFormer. This model is specifically designed to generate fMRI-conditioned feature vectors that can be used for conditioning Stable Diffusion model for fMRI- to-image generation or large language model (LLM) for fMRI-to-text generation. More specifically, MindFormer incorporates two key innovations: 1) a subject specific token that effectively capture individual differences in fMRI signals while synergistically combines multi subject fMRI data for training, and 2) a novel feature embedding and training scheme based on the IP-Adapter to extract semantically meaningful features from fMRI signals. Our experimental results demonstrate that MindFormer generates semantically consistent images and text across different subjects. Since our MindFormer maintains semantic fidelity by fully utilizing the training data across different subjects by significantly surpassing existing models in multi-subject brain decoding, this may help deepening our understanding of neural processing variations among individuals.
Paper Structure (16 sections, 3 equations, 10 figures, 5 tables)

This paper contains 16 sections, 3 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Multi-subject brain decoding results by MindFormer. MindFormer can reconstruct semantically aligned images across subjects. Additional reconstruction samples can be found in Figure \ref{['fig:comparison']} and Appendix \ref{['moreresults']}.
  • Figure 2: MindFormer architecture. The fMRI voxels obtained from observing the stimulus image are processed through the MindFormer to extract image features. These features are then utilized in conjunction with the Stable Diffusion model and a decoder to reconstruct the previously viewed image. MindFormer is trained to counter the subject specific bias through learnable subject token.
  • Figure 3: Training stage: The fMRI signals from each subject are passed through a subject-specific linear layer. Subsequently, each signal is prepended with a learnable subject token and passed through the same MindFormer Encoder. The network is then trained to match the image feature embeddings obtained from passing the images through the IP-Adapter. Inference Stage: These obtained embeddings are integrated into the stable diffusion process as conditions. The diffusion model utilizes these embeddings to iteratively denoise and reconstruct the image.
  • Figure 4: Visual comparison of our proposed MindFormer with other methods. Our resulting images are semantically closest to the seen images.
  • Figure 5: Reconstructed image from human brain activity on the presence of learnable subject tokens (ST) in MindFormer. The model incorporating subject tokens demonstrates higher correlation in semantic meaning with the seen image.
  • ...and 5 more figures