Table of Contents
Fetching ...

MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction

Zixuan Gong, Qi Zhang, Guangyin Bao, Lei Zhu, Ke Liu, Liang Hu, Duoqian Miao

TL;DR

MindTuner tackles cross-subject visual decoding from fMRI by learning subject-specific visual fingerprints and bridging fMRI to text through a Pivot module. It combines a robust multi-subject pre-training regime with lightweight, non-linear Skip-LoRAs and a trainable adaptive projector to fine-tune new subjects with minimal data, achieving state-of-the-art NSD performance for both retrieval and reconstruction at 1 hour and 40 hours of data. The approach yields meaningful neuroscience insights, showing non-linear processing concentrated in higher visual areas, and reduces data requirements for universal brain decoding with practical implications for scalable BMI/imaging applications.

Abstract

Decoding natural visual scenes from brain activity has flourished, with extensive research in single-subject tasks and, however, less in cross-subject tasks. Reconstructing high-quality images in cross-subject tasks is a challenging problem due to profound individual differences between subjects and the scarcity of data annotation. In this work, we proposed MindTuner for cross-subject visual decoding, which achieves high-quality and rich semantic reconstructions using only 1 hour of fMRI training data benefiting from the phenomena of visual fingerprint in the human visual system and a novel fMRI-to-text alignment paradigm. Firstly, we pre-train a multi-subject model among 7 subjects and fine-tune it with scarce data on new subjects, where LoRAs with Skip-LoRAs are utilized to learn the visual fingerprint. Then, we take the image modality as the intermediate pivot modality to achieve fMRI-to-text alignment, which achieves impressive fMRI-to-text retrieval performance and corrects fMRI-to-image reconstruction with fine-tuned semantics. The results of both qualitative and quantitative analyses demonstrate that MindTuner surpasses state-of-the-art cross-subject visual decoding models on the Natural Scenes Dataset (NSD), whether using training data of 1 hour or 40 hours.

MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction

TL;DR

MindTuner tackles cross-subject visual decoding from fMRI by learning subject-specific visual fingerprints and bridging fMRI to text through a Pivot module. It combines a robust multi-subject pre-training regime with lightweight, non-linear Skip-LoRAs and a trainable adaptive projector to fine-tune new subjects with minimal data, achieving state-of-the-art NSD performance for both retrieval and reconstruction at 1 hour and 40 hours of data. The approach yields meaningful neuroscience insights, showing non-linear processing concentrated in higher visual areas, and reduces data requirements for universal brain decoding with practical implications for scalable BMI/imaging applications.

Abstract

Decoding natural visual scenes from brain activity has flourished, with extensive research in single-subject tasks and, however, less in cross-subject tasks. Reconstructing high-quality images in cross-subject tasks is a challenging problem due to profound individual differences between subjects and the scarcity of data annotation. In this work, we proposed MindTuner for cross-subject visual decoding, which achieves high-quality and rich semantic reconstructions using only 1 hour of fMRI training data benefiting from the phenomena of visual fingerprint in the human visual system and a novel fMRI-to-text alignment paradigm. Firstly, we pre-train a multi-subject model among 7 subjects and fine-tune it with scarce data on new subjects, where LoRAs with Skip-LoRAs are utilized to learn the visual fingerprint. Then, we take the image modality as the intermediate pivot modality to achieve fMRI-to-text alignment, which achieves impressive fMRI-to-text retrieval performance and corrects fMRI-to-image reconstruction with fine-tuned semantics. The results of both qualitative and quantitative analyses demonstrate that MindTuner surpasses state-of-the-art cross-subject visual decoding models on the Natural Scenes Dataset (NSD), whether using training data of 1 hour or 40 hours.
Paper Structure (37 sections, 12 equations, 15 figures, 8 tables)

This paper contains 37 sections, 12 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: Cross-Subject Visual Decoding and Image Reconstruction. Subjects with adequate fMRI data are aligned to decode visual stimuli via learning a shared network. A new subject, even with scarce visual stimulus, is aligned to the common space of the shared network, which perceives the subject's unique visual fingerprint to ensure precise visual decoding.
  • Figure 2: Visual fingerprint experiments across subjects. 'Within' denotes Pearson correlation coefficient of Distortion Indices in within-subject experiments, while 'between' denotes between-subject.
  • Figure 3: Schematic diagram of MindTuner. The training process was split into two phases: Multi-subject Pre-training and Cross-subject Fine-tuning, in which the corresponding modules were trained. The predicted embeddings are first obtained through MindTuner, and then the preliminary reconstructed image was obtained by SDXL unCLIP. The final reconstructed image is obtained by text retrieval and semantic correction by SDXL Image-Variation.
  • Figure 4: SDXL unCLIP reconstructions and SDXL Image-Variation by MindEye2's refinement or our correction.
  • Figure 5: MindTuner vs MindEye2 reconstructions from fMRI brain activity with only 1 hour of data.
  • ...and 10 more figures