Table of Contents
Fetching ...

BrainVis: Exploring the Bridge between Brain and Visual Signals via Image Reconstruction

Honghao Fu, Zhiqi Shen, Jing Jih Chin, Hao Wang

TL;DR

BrainVis tackles EEG-based image reconstruction by addressing noise and data limitations through a three-part architecture: a time-frequency EEG encoder using Latent Masked Modeling and FFT→LSTM, semantic interpolation that aligns EEG embeddings with a blended CLIP representation of coarse labels and BLIP-2-generated captions, and cascaded diffusion models conditioned on the aligned semantics. The method achieves superior semantic fidelity and image quality with only 10% of the training data used by prior work, outperforming state-of-the-art baselines across GA, IS, and FID, while also performing comprehensive ablations to validate component contributions. By eliminating dependence on large external EEG datasets and leveraging cross-modal semantic alignment, BrainVis advances practical EEG-based visual reconstruction and provides a scalable framework for future multimodal brain–vision research. The work also analyzes limitations and suggests that EEG signals may emphasize general visual properties, informing directions for refining feature representations and alignment strategies.

Abstract

Analyzing and reconstructing visual stimuli from brain signals effectively advances the understanding of human visual system. However, the EEG signals are complex and contain significant noise. This leads to substantial limitations in existing works of visual stimuli reconstruction from EEG, such as difficulties in aligning EEG embeddings with the fine-grained semantic information and a heavy reliance on additional large self-collected dataset for training. To address these challenges, we propose a novel approach called BrainVis. Firstly, we divide the EEG signals into various units and apply a self-supervised approach on them to obtain EEG time-domain features, in an attempt to ease the training difficulty. Additionally, we also propose to utilize the frequency-domain features to enhance the EEG representations. Then, we simultaneously align EEG time-frequency embeddings with the interpolation of the coarse and fine-grained semantics in the CLIP space, to highlight the primary visual components and reduce the cross-modal alignment difficulty. Finally, we adopt the cascaded diffusion models to reconstruct images. Using only 10\% training data of the previous work, our proposed BrainVis outperforms state of the arts in both semantic fidelity reconstruction and generation quality. The code is available at https://github.com/RomGai/BrainVis.

BrainVis: Exploring the Bridge between Brain and Visual Signals via Image Reconstruction

TL;DR

BrainVis tackles EEG-based image reconstruction by addressing noise and data limitations through a three-part architecture: a time-frequency EEG encoder using Latent Masked Modeling and FFT→LSTM, semantic interpolation that aligns EEG embeddings with a blended CLIP representation of coarse labels and BLIP-2-generated captions, and cascaded diffusion models conditioned on the aligned semantics. The method achieves superior semantic fidelity and image quality with only 10% of the training data used by prior work, outperforming state-of-the-art baselines across GA, IS, and FID, while also performing comprehensive ablations to validate component contributions. By eliminating dependence on large external EEG datasets and leveraging cross-modal semantic alignment, BrainVis advances practical EEG-based visual reconstruction and provides a scalable framework for future multimodal brain–vision research. The work also analyzes limitations and suggests that EEG signals may emphasize general visual properties, informing directions for refining feature representations and alignment strategies.

Abstract

Analyzing and reconstructing visual stimuli from brain signals effectively advances the understanding of human visual system. However, the EEG signals are complex and contain significant noise. This leads to substantial limitations in existing works of visual stimuli reconstruction from EEG, such as difficulties in aligning EEG embeddings with the fine-grained semantic information and a heavy reliance on additional large self-collected dataset for training. To address these challenges, we propose a novel approach called BrainVis. Firstly, we divide the EEG signals into various units and apply a self-supervised approach on them to obtain EEG time-domain features, in an attempt to ease the training difficulty. Additionally, we also propose to utilize the frequency-domain features to enhance the EEG representations. Then, we simultaneously align EEG time-frequency embeddings with the interpolation of the coarse and fine-grained semantics in the CLIP space, to highlight the primary visual components and reduce the cross-modal alignment difficulty. Finally, we adopt the cascaded diffusion models to reconstruct images. Using only 10\% training data of the previous work, our proposed BrainVis outperforms state of the arts in both semantic fidelity reconstruction and generation quality. The code is available at https://github.com/RomGai/BrainVis.
Paper Structure (18 sections, 4 equations, 7 figures, 7 tables)

This paper contains 18 sections, 4 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Time series and spectrograms of EEG signals from two subjects, both derived from the same visual stimulus.
  • Figure 2: Framework of our proposed BrainVis. The blue blocks are the components including in inference process of the pipeline. The ground truth (GT) images are only used for supervision during training.
  • Figure 3: Latent Masked Modeling (LMM) of time branch. First, the EEG is segmented into visible and masked slices, then the masked slices are tokenized as $l_m$. Next, the visible features $f_v$ are extracted by transformer blocks, which are used for predicting the masked features $f_{mp}$ and their codewords. The probability representation of predicted codewords is $p_m$. Meanwhile, a non-trainable model obtained by moving average from the transformer blocks is used to extract real features of masked slices $f_m$. Finally, reconstruction and classification objectives are performed.
  • Figure 4: Comparison of reconstructed images’ quality.
  • Figure 5: Comparison of reconstructed images with DreamDiffusion on the same ground truth (GT) image.
  • ...and 2 more figures