Table of Contents
Fetching ...

MindSemantix: Deciphering Brain Visual Experiences with a Brain-Language Model

Ziqi Ren, Jie Li, Xuetong Xue, Xin Li, Fan Yang, Zhicheng Jiao, Xinbo Gao

TL;DR

MindSemantix addresses the challenge of decoding visual experiences from fMRI into meaningful natural language, proposing an end-to-end Brain-Language Model that fuses a pre-trained brain encoder with a frozen LLM via a Brain-Text Transformer and Brain Q-Former. It introduces self-supervised BED pre-training to improve cross-subject generalization and trains the Brain-Language Model with a language-model loss that conditions on COCO captions, enabling robust brain-to-text mapping. The framework further enables downstream stimulus reconstruction by conditioning Stable Diffusion on decoded captions, yielding semantically faithful images. Overall, MindSemantix achieves state-of-the-art brain captioning and demonstrates strong potential for caption-guided reconstruction, with ablations confirming the value of pre-training, cross-modal alignment, and caption scale. This work advances brain decoding by integrating language priors into the neural code, with practical implications for neuroimaging-based diagnostics and brain-computer interfaces.

Abstract

Deciphering the human visual experience through brain activities captured by fMRI represents a compelling and cutting-edge challenge in the field of neuroscience research. Compared to merely predicting the viewed image itself, decoding brain activity into meaningful captions provides a higher-level interpretation and summarization of visual information, which naturally enhances the application flexibility in real-world situations. In this work, we introduce MindSemantix, a novel multi-modal framework that enables LLMs to comprehend visually-evoked semantic content in brain activity. Our MindSemantix explores a more ideal brain captioning paradigm by weaving LLMs into brain activity analysis, crafting a seamless, end-to-end Brain-Language Model. To effectively capture semantic information from brain responses, we propose Brain-Text Transformer, utilizing a Brain Q-Former as its core architecture. It integrates a pre-trained brain encoder with a frozen LLM to achieve multi-modal alignment of brain-vision-language and establish a robust brain-language correspondence. To enhance the generalizability of neural representations, we pre-train our brain encoder on a large-scale, cross-subject fMRI dataset using self-supervised learning techniques. MindSemantix provides more feasibility to downstream brain decoding tasks such as stimulus reconstruction. Conditioned by MindSemantix captioning, our framework facilitates this process by integrating with advanced generative models like Stable Diffusion and excels in understanding brain visual perception. MindSemantix generates high-quality captions that are deeply rooted in the visual and semantic information derived from brain activity. This approach has demonstrated substantial quantitative improvements over prior art. Our code will be released.

MindSemantix: Deciphering Brain Visual Experiences with a Brain-Language Model

TL;DR

MindSemantix addresses the challenge of decoding visual experiences from fMRI into meaningful natural language, proposing an end-to-end Brain-Language Model that fuses a pre-trained brain encoder with a frozen LLM via a Brain-Text Transformer and Brain Q-Former. It introduces self-supervised BED pre-training to improve cross-subject generalization and trains the Brain-Language Model with a language-model loss that conditions on COCO captions, enabling robust brain-to-text mapping. The framework further enables downstream stimulus reconstruction by conditioning Stable Diffusion on decoded captions, yielding semantically faithful images. Overall, MindSemantix achieves state-of-the-art brain captioning and demonstrates strong potential for caption-guided reconstruction, with ablations confirming the value of pre-training, cross-modal alignment, and caption scale. This work advances brain decoding by integrating language priors into the neural code, with practical implications for neuroimaging-based diagnostics and brain-computer interfaces.

Abstract

Deciphering the human visual experience through brain activities captured by fMRI represents a compelling and cutting-edge challenge in the field of neuroscience research. Compared to merely predicting the viewed image itself, decoding brain activity into meaningful captions provides a higher-level interpretation and summarization of visual information, which naturally enhances the application flexibility in real-world situations. In this work, we introduce MindSemantix, a novel multi-modal framework that enables LLMs to comprehend visually-evoked semantic content in brain activity. Our MindSemantix explores a more ideal brain captioning paradigm by weaving LLMs into brain activity analysis, crafting a seamless, end-to-end Brain-Language Model. To effectively capture semantic information from brain responses, we propose Brain-Text Transformer, utilizing a Brain Q-Former as its core architecture. It integrates a pre-trained brain encoder with a frozen LLM to achieve multi-modal alignment of brain-vision-language and establish a robust brain-language correspondence. To enhance the generalizability of neural representations, we pre-train our brain encoder on a large-scale, cross-subject fMRI dataset using self-supervised learning techniques. MindSemantix provides more feasibility to downstream brain decoding tasks such as stimulus reconstruction. Conditioned by MindSemantix captioning, our framework facilitates this process by integrating with advanced generative models like Stable Diffusion and excels in understanding brain visual perception. MindSemantix generates high-quality captions that are deeply rooted in the visual and semantic information derived from brain activity. This approach has demonstrated substantial quantitative improvements over prior art. Our code will be released.
Paper Structure (13 sections, 2 equations, 6 figures, 4 tables)

This paper contains 13 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: MindSemantix overall schematic.
  • Figure 2: Learning procedure of MindSemantix. Top: pre-training phase of self-supervised Brain-Encoder-Decoder. Bottom: training phase of end-to-end Brain-Language Model.
  • Figure 3: Sample brain captioning results across visual stimulus categories of MindSemantix and SOTA method. The same test set was used.
  • Figure 4: MindSemantix captioning performance on Subject1 fMRI added with Gaussian noise ($mean=0$, $std$ equals the mean of fMRI signal values) by coefficients from 0.1 to 1.
  • Figure 5: Sample visual reconstruction and captioning results of MindSemantix on each subject.
  • ...and 1 more figures