Table of Contents
Fetching ...

Seeing Beyond the Brain: Conditional Diffusion Model with Sparse Masked Modeling for Vision Decoding

Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, Juan Helen Zhou

TL;DR

This work tackles the challenge of reconstructing semantically faithful images from fMRI by introducing MinD-Vis, a two-stage framework that first learns rich fMRI representations via Sparse-Coded Masked Brain Modeling and then performs conditional image synthesis with a Double-Conditioned Latent Diffusion Model. The approach leverages large unlabeled fMRI data for representation learning and minimal paired data for finetuning, achieving superior semantic accuracy and image quality on GOD and BOLD5000 datasets. Extensive ablations demonstrate the importance of SC-MBM and the double-conditioning scheme for robust brain-to-image decoding. The results underscore the potential of combining brain-inspired encoding with latent diffusion generation for advancing brain-computer interfaces and cross-domain vision understanding.

Abstract

Decoding visual stimuli from brain recordings aims to deepen our understanding of the human visual system and build a solid foundation for bridging human and computer vision through the Brain-Computer Interface. However, reconstructing high-quality images with correct semantics from brain recordings is a challenging problem due to the complex underlying representations of brain signals and the scarcity of data annotations. In this work, we present MinD-Vis: Sparse Masked Brain Modeling with Double-Conditioned Latent Diffusion Model for Human Vision Decoding. Firstly, we learn an effective self-supervised representation of fMRI data using mask modeling in a large latent space inspired by the sparse coding of information in the primary visual cortex. Then by augmenting a latent diffusion model with double-conditioning, we show that MinD-Vis can reconstruct highly plausible images with semantically matching details from brain recordings using very few paired annotations. We benchmarked our model qualitatively and quantitatively; the experimental results indicate that our method outperformed state-of-the-art in both semantic mapping (100-way semantic classification) and generation quality (FID) by 66% and 41% respectively. An exhaustive ablation study was also conducted to analyze our framework.

Seeing Beyond the Brain: Conditional Diffusion Model with Sparse Masked Modeling for Vision Decoding

TL;DR

This work tackles the challenge of reconstructing semantically faithful images from fMRI by introducing MinD-Vis, a two-stage framework that first learns rich fMRI representations via Sparse-Coded Masked Brain Modeling and then performs conditional image synthesis with a Double-Conditioned Latent Diffusion Model. The approach leverages large unlabeled fMRI data for representation learning and minimal paired data for finetuning, achieving superior semantic accuracy and image quality on GOD and BOLD5000 datasets. Extensive ablations demonstrate the importance of SC-MBM and the double-conditioning scheme for robust brain-to-image decoding. The results underscore the potential of combining brain-inspired encoding with latent diffusion generation for advancing brain-computer interfaces and cross-domain vision understanding.

Abstract

Decoding visual stimuli from brain recordings aims to deepen our understanding of the human visual system and build a solid foundation for bridging human and computer vision through the Brain-Computer Interface. However, reconstructing high-quality images with correct semantics from brain recordings is a challenging problem due to the complex underlying representations of brain signals and the scarcity of data annotations. In this work, we present MinD-Vis: Sparse Masked Brain Modeling with Double-Conditioned Latent Diffusion Model for Human Vision Decoding. Firstly, we learn an effective self-supervised representation of fMRI data using mask modeling in a large latent space inspired by the sparse coding of information in the primary visual cortex. Then by augmenting a latent diffusion model with double-conditioning, we show that MinD-Vis can reconstruct highly plausible images with semantically matching details from brain recordings using very few paired annotations. We benchmarked our model qualitatively and quantitatively; the experimental results indicate that our method outperformed state-of-the-art in both semantic mapping (100-way semantic classification) and generation quality (FID) by 66% and 41% respectively. An exhaustive ablation study was also conducted to analyze our framework.
Paper Structure (51 sections, 3 equations, 14 figures, 8 tables, 1 algorithm)

This paper contains 51 sections, 3 equations, 14 figures, 8 tables, 1 algorithm.

Figures (14)

  • Figure 1: Individual Differences in Regions Responding to Visual Stimuli. Masks of the regions of interest activating during the same visual task differ in location and size across subjects. The primary visual cortex at the left (red) and the right (orange) hemisphere are shown.
  • Figure 2: MinD-Vis. Stage A (left): Pre-train on fMRI with SC-MBM. We patchify, randomly mask the fMRI, and then tokenize them to large embeddings. We train an autoencoder ($\mathcal{E_{MBM}}$ and $\mathcal{D_{MBM}}$) to recover the masked patches. Stage B (right): Integration with the LDM through double conditioning. We project the fMRI latent ($\mathcal{L}_{fMRI}$) through two paths to the LDM conditioning space with a latent dimension projector ($\mathcal{P}_{fMRI\rightarrow Cond}$). One path connects directly to cross-attention heads in the LDM. Another path adds the fMRI latent to time embeddings. The LDM operates on a low-dimensional, compressed version of the original image (i.e. image latent), however, the original image is used in this figure for illustrations.
  • Figure 3: Masked Brain Modeling. Mask ratio 0.75; 4500 voxels
  • Figure 4: Decoding Performance Comparisons on GOD Test Set. The ground truth, images reconstructed by MinD-Vis and images reconstructed from three other methods are shown for comparison. MinD-Vis decoded the most accurate and plausible images with semantically similar details.
  • Figure 5: Quantitative Performance Comparisons on GOD Test Set. Performance is evaluated in terms of semantic correctness (1000-trial n-way top-k classification accuracy; the higher the better) and generation quality (FID; the lower the better).
  • ...and 9 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2