Table of Contents
Fetching ...

BrainChat: Decoding Semantic Information from fMRI using Vision-language Pretrained Models

Wanaiu Huang

TL;DR

BrainChat introduces a two-stage framework that decodes semantic information from fMRI by marrying Masked Brain Modeling (MBM) with a fixed, decoder-based vision-language model (CoCa). It aligns fMRI with image and text embeddings through cross-modal contrastive losses and generates textual content via a cross-attentive brain decoder, enabling fMRI captioning and, for the first time, fMRI question answering (fQA). The approach demonstrates state-of-the-art performance in fMRI captioning and robust fQA capability, even under data-limited conditions, highlighting potential clinical impact for AAC and human-computer interaction. The work also shows that decoding semantic information from brain activity can be achieved without image data, broadening applicability to real-world settings with restricted data availability.

Abstract

Semantic information is vital for human interaction, and decoding it from brain activity enables non-invasive clinical augmentative and alternative communication. While there has been significant progress in reconstructing visual images, few studies have focused on the language aspect. To address this gap, leveraging the powerful capabilities of the decoder-based vision-language pretrained model CoCa, this paper proposes BrainChat, a simple yet effective generative framework aimed at rapidly accomplishing semantic information decoding tasks from brain activity, including fMRI question answering and fMRI captioning. BrainChat employs the self-supervised approach of Masked Brain Modeling to encode sparse fMRI data, obtaining a more compact embedding representation in the latent space. Subsequently, BrainChat bridges the gap between modalities by applying contrastive loss, resulting in aligned representations of fMRI, image, and text embeddings. Furthermore, the fMRI embeddings are mapped to the generative Brain Decoder via cross-attention layers, where they guide the generation of textual content about fMRI in a regressive manner by minimizing caption loss. Empirically, BrainChat exceeds the performance of existing state-of-the-art methods in the fMRI captioning task and, for the first time, implements fMRI question answering. Additionally, BrainChat is highly flexible and can achieve high performance without image data, making it better suited for real-world scenarios with limited data.

BrainChat: Decoding Semantic Information from fMRI using Vision-language Pretrained Models

TL;DR

BrainChat introduces a two-stage framework that decodes semantic information from fMRI by marrying Masked Brain Modeling (MBM) with a fixed, decoder-based vision-language model (CoCa). It aligns fMRI with image and text embeddings through cross-modal contrastive losses and generates textual content via a cross-attentive brain decoder, enabling fMRI captioning and, for the first time, fMRI question answering (fQA). The approach demonstrates state-of-the-art performance in fMRI captioning and robust fQA capability, even under data-limited conditions, highlighting potential clinical impact for AAC and human-computer interaction. The work also shows that decoding semantic information from brain activity can be achieved without image data, broadening applicability to real-world settings with restricted data availability.

Abstract

Semantic information is vital for human interaction, and decoding it from brain activity enables non-invasive clinical augmentative and alternative communication. While there has been significant progress in reconstructing visual images, few studies have focused on the language aspect. To address this gap, leveraging the powerful capabilities of the decoder-based vision-language pretrained model CoCa, this paper proposes BrainChat, a simple yet effective generative framework aimed at rapidly accomplishing semantic information decoding tasks from brain activity, including fMRI question answering and fMRI captioning. BrainChat employs the self-supervised approach of Masked Brain Modeling to encode sparse fMRI data, obtaining a more compact embedding representation in the latent space. Subsequently, BrainChat bridges the gap between modalities by applying contrastive loss, resulting in aligned representations of fMRI, image, and text embeddings. Furthermore, the fMRI embeddings are mapped to the generative Brain Decoder via cross-attention layers, where they guide the generation of textual content about fMRI in a regressive manner by minimizing caption loss. Empirically, BrainChat exceeds the performance of existing state-of-the-art methods in the fMRI captioning task and, for the first time, implements fMRI question answering. Additionally, BrainChat is highly flexible and can achieve high performance without image data, making it better suited for real-world scenarios with limited data.
Paper Structure (12 sections, 4 equations, 4 figures, 5 tables)

This paper contains 12 sections, 4 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Decoding semantic information from fMRI data. In the fMRI captioning task, a caption is generated based on the fMRI data, describing the semantic information observed during subject scanning. In the fMRI question answering task, corresponding answers are generated based on given questions.
  • Figure 2: (a) The BrainChat framework consists primarily of encoding and decoding parts. In the encoding part, three encoders are utilized for fMRI, image, and text, each extracting features from its respective modality. The image encoder is exclusively employed during training to enhance the quality of text generation. The decoding part comprises two decoders: the fMRI decoder and the brain decoder, employed to reconstruct masked fMRI data and generate corresponding text based on fMRI features, respectively. During training, we initially reconstruct masked fMRI data using MBM. Subsequently, we train the fMRI encoder and brain decoder using fMRI-image contrastive loss, fMRI-text contrastive loss, and caption loss. Moreover, BrainChat can perform fMRI captioning and fQA tasks without relying on image information, allowing for the removal of the image encoder. (b)(c) In the inference stage, BrainChat is utilized for fQA and fMRI captioning without the need for any visual data.
  • Figure 3: Samples of captions generated by BrainChat. On the left are the visual stimulus images, and on the right are the corresponding captions. Captions generated by BrainChat are shown in green, while the five ground truth captions are shown in black. It is evident that BrainChat can generate coherent and human-readable text solely using fMRI data. However, there are still some grammar errors (highlighted in red). Additionally, some of the generated captions are missing periods at the end.
  • Figure 4: Samples of fQA output by Brainchat. The text in green represents the answers generated by Brainchat based on the questions. Ground truth refers to all the answers provided by the VQA dataset.