Table of Contents
Fetching ...

LLM4Brain: Training a Large Language Model for Brain Video Understanding

Ruizhe Zheng, Lichao Sun

TL;DR

This study introduces an LLM-based approach for reconstructing visual-semantic information from fMRI signals elicited by video stimuli by employing fine-tuning techniques on an fMRI encoder equipped with adaptors to transform brain responses into latent representations aligned with the video stimuli.

Abstract

Decoding visual-semantic information from brain signals, such as functional MRI (fMRI), across different subjects poses significant challenges, including low signal-to-noise ratio, limited data availability, and cross-subject variability. Recent advancements in large language models (LLMs) show remarkable effectiveness in processing multimodal information. In this study, we introduce an LLM-based approach for reconstructing visual-semantic information from fMRI signals elicited by video stimuli. Specifically, we employ fine-tuning techniques on an fMRI encoder equipped with adaptors to transform brain responses into latent representations aligned with the video stimuli. Subsequently, these representations are mapped to textual modality by LLM. In particular, we integrate self-supervised domain adaptation methods to enhance the alignment between visual-semantic information and brain responses. Our proposed method achieves good results using various quantitative semantic metrics, while yielding similarity with ground-truth information.

LLM4Brain: Training a Large Language Model for Brain Video Understanding

TL;DR

This study introduces an LLM-based approach for reconstructing visual-semantic information from fMRI signals elicited by video stimuli by employing fine-tuning techniques on an fMRI encoder equipped with adaptors to transform brain responses into latent representations aligned with the video stimuli.

Abstract

Decoding visual-semantic information from brain signals, such as functional MRI (fMRI), across different subjects poses significant challenges, including low signal-to-noise ratio, limited data availability, and cross-subject variability. Recent advancements in large language models (LLMs) show remarkable effectiveness in processing multimodal information. In this study, we introduce an LLM-based approach for reconstructing visual-semantic information from fMRI signals elicited by video stimuli. Specifically, we employ fine-tuning techniques on an fMRI encoder equipped with adaptors to transform brain responses into latent representations aligned with the video stimuli. Subsequently, these representations are mapped to textual modality by LLM. In particular, we integrate self-supervised domain adaptation methods to enhance the alignment between visual-semantic information and brain responses. Our proposed method achieves good results using various quantitative semantic metrics, while yielding similarity with ground-truth information.
Paper Structure (9 sections, 7 equations, 2 figures, 2 tables)

This paper contains 9 sections, 7 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The overall framework of our approach for brain visual-semantic reconstruction. The fMRI is encoded by a 3DCNN tokenizer and SC-MBM. The video is encoded by ViT. The parameters of SC-MBM, ViT and Q-Former are all frozen, but SC-MBM and Q-Former is inserted with the nonlinear adaptor module. During training, it learns cross-subject semantically informed fMRI latent representation by cross-modal alignment and domain adaptation, and the quality of decoding is improved by minimizing the difference between video- and fMRI-based video understanding by the instruction-tuned LLM.
  • Figure 2: The nonlinear adaptor used for finetuning.