MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction

Zixuan Gong; Qi Zhang; Guangyin Bao; Lei Zhu; Ke Liu; Liang Hu; Duoqian Miao

MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction

Zixuan Gong, Qi Zhang, Guangyin Bao, Lei Zhu, Ke Liu, Liang Hu, Duoqian Miao

TL;DR

MindTuner tackles cross-subject visual decoding from fMRI by learning subject-specific visual fingerprints and bridging fMRI to text through a Pivot module. It combines a robust multi-subject pre-training regime with lightweight, non-linear Skip-LoRAs and a trainable adaptive projector to fine-tune new subjects with minimal data, achieving state-of-the-art NSD performance for both retrieval and reconstruction at 1 hour and 40 hours of data. The approach yields meaningful neuroscience insights, showing non-linear processing concentrated in higher visual areas, and reduces data requirements for universal brain decoding with practical implications for scalable BMI/imaging applications.

Abstract

Decoding natural visual scenes from brain activity has flourished, with extensive research in single-subject tasks and, however, less in cross-subject tasks. Reconstructing high-quality images in cross-subject tasks is a challenging problem due to profound individual differences between subjects and the scarcity of data annotation. In this work, we proposed MindTuner for cross-subject visual decoding, which achieves high-quality and rich semantic reconstructions using only 1 hour of fMRI training data benefiting from the phenomena of visual fingerprint in the human visual system and a novel fMRI-to-text alignment paradigm. Firstly, we pre-train a multi-subject model among 7 subjects and fine-tune it with scarce data on new subjects, where LoRAs with Skip-LoRAs are utilized to learn the visual fingerprint. Then, we take the image modality as the intermediate pivot modality to achieve fMRI-to-text alignment, which achieves impressive fMRI-to-text retrieval performance and corrects fMRI-to-image reconstruction with fine-tuned semantics. The results of both qualitative and quantitative analyses demonstrate that MindTuner surpasses state-of-the-art cross-subject visual decoding models on the Natural Scenes Dataset (NSD), whether using training data of 1 hour or 40 hours.

MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction

TL;DR

Abstract

Paper Structure (37 sections, 12 equations, 15 figures, 8 tables)

This paper contains 37 sections, 12 equations, 15 figures, 8 tables.

Introduction
Related Work
Cross-Subject Functional Alignment
Text Modality in Visual Decoding
Preliminary on visual fingerprint
Method
Multi-Subject Pre-training
Multi-Subject Functional Alignment
MLP Backbone
Retrieval Submodules
Low-level and High-level Submodules
New-Subject Fine-tuning
Low-Rank Adaptation
non-linear Skip-LoRAs
Pivot with Adaptive Projector
...and 22 more sections

Figures (15)

Figure 1: Cross-Subject Visual Decoding and Image Reconstruction. Subjects with adequate fMRI data are aligned to decode visual stimuli via learning a shared network. A new subject, even with scarce visual stimulus, is aligned to the common space of the shared network, which perceives the subject's unique visual fingerprint to ensure precise visual decoding.
Figure 2: Visual fingerprint experiments across subjects. 'Within' denotes Pearson correlation coefficient of Distortion Indices in within-subject experiments, while 'between' denotes between-subject.
Figure 3: Schematic diagram of MindTuner. The training process was split into two phases: Multi-subject Pre-training and Cross-subject Fine-tuning, in which the corresponding modules were trained. The predicted embeddings are first obtained through MindTuner, and then the preliminary reconstructed image was obtained by SDXL unCLIP. The final reconstructed image is obtained by text retrieval and semantic correction by SDXL Image-Variation.
Figure 4: SDXL unCLIP reconstructions and SDXL Image-Variation by MindEye2's refinement or our correction.
Figure 5: MindTuner vs MindEye2 reconstructions from fMRI brain activity with only 1 hour of data.
...and 10 more figures

MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction

TL;DR

Abstract

MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction

Authors

TL;DR

Abstract

Table of Contents

Figures (15)