Table of Contents
Fetching ...

UMind: A Unified Multitask Network for Zero-Shot M/EEG Visual Decoding

Chengjian Xu, Yonghao Song, Zelin Liao, Haochuan Zhang, Qiong Wang, Qingqing Zheng

TL;DR

UMind tackles zero-shot visual decoding from time-resolved M/EEG signals by proposing a unified multitask framework that jointly retrieves, classifies, and reconstructs visual stimuli. It achieves this through multimodal alignment of M/EEG with both images and dual-granularity text (coarse labels and fine-grained captions) and uses the resulting neural-visual and neural-semantic representations as dual conditions for a diffusion-based image reconstruction model. The approach leverages contrastive and MSE losses with frozen CLIP encoders, dual guidance via a diffusion prior and Q-Former, and SDXL-Turbo for high-quality generation, yielding state-of-the-art retrieval, classification, and reconstruction on THINGS-EEG and THINGS-MEG datasets. These findings highlight the value of integrating semantic information into neural decoding pipelines, enabling more accurate, interpretable, and semantically rich visual reconstructions for brain-computer interfaces.

Abstract

Decoding visual information from time-resolved brain recordings, such as EEG and MEG, plays a pivotal role in real-time brain-computer interfaces. However, existing approaches primarily focus on direct brain-image feature alignment and are limited to single-task frameworks or task-specific models. In this paper, we propose a Unified MultItask Network for zero-shot M/EEG visual Decoding (referred to UMind), including visual stimulus retrieval, classification, and reconstruction, where multiple tasks mutually enhance each other. Our method learns robust neural-visual and semantic representations through multimodal alignment with both image and text modalities. The integration of both coarse and fine-grained texts enhances the extraction of these neural representations, enabling more detailed semantic and visual decoding. These representations then serve as dual conditional inputs to a pre-trained diffusion model, guiding visual reconstruction from both visual and semantic perspectives. Extensive evaluations on MEG and EEG datasets demonstrate the effectiveness, robustness, and biological plausibility of our approach in capturing spatiotemporal neural dynamics. Our approach sets a multitask pipeline for brain visual decoding, highlighting the synergy of semantic information in visual feature extraction. The code is available at https://github.com/xuchengjian632/UMind.

UMind: A Unified Multitask Network for Zero-Shot M/EEG Visual Decoding

TL;DR

UMind tackles zero-shot visual decoding from time-resolved M/EEG signals by proposing a unified multitask framework that jointly retrieves, classifies, and reconstructs visual stimuli. It achieves this through multimodal alignment of M/EEG with both images and dual-granularity text (coarse labels and fine-grained captions) and uses the resulting neural-visual and neural-semantic representations as dual conditions for a diffusion-based image reconstruction model. The approach leverages contrastive and MSE losses with frozen CLIP encoders, dual guidance via a diffusion prior and Q-Former, and SDXL-Turbo for high-quality generation, yielding state-of-the-art retrieval, classification, and reconstruction on THINGS-EEG and THINGS-MEG datasets. These findings highlight the value of integrating semantic information into neural decoding pipelines, enabling more accurate, interpretable, and semantically rich visual reconstructions for brain-computer interfaces.

Abstract

Decoding visual information from time-resolved brain recordings, such as EEG and MEG, plays a pivotal role in real-time brain-computer interfaces. However, existing approaches primarily focus on direct brain-image feature alignment and are limited to single-task frameworks or task-specific models. In this paper, we propose a Unified MultItask Network for zero-shot M/EEG visual Decoding (referred to UMind), including visual stimulus retrieval, classification, and reconstruction, where multiple tasks mutually enhance each other. Our method learns robust neural-visual and semantic representations through multimodal alignment with both image and text modalities. The integration of both coarse and fine-grained texts enhances the extraction of these neural representations, enabling more detailed semantic and visual decoding. These representations then serve as dual conditional inputs to a pre-trained diffusion model, guiding visual reconstruction from both visual and semantic perspectives. Extensive evaluations on MEG and EEG datasets demonstrate the effectiveness, robustness, and biological plausibility of our approach in capturing spatiotemporal neural dynamics. Our approach sets a multitask pipeline for brain visual decoding, highlighting the synergy of semantic information in visual feature extraction. The code is available at https://github.com/xuchengjian632/UMind.

Paper Structure

This paper contains 28 sections, 11 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: The proposed UMind framework enables zero-shot visual decoding from M/EEG signals, which simultaneously performs visual stimulus retrieval, classification, and reconstruction. It comprises three key components: a multimodal alignment module, a visual stimulus retrieval and classification module, and a dual conditioned diffusion reconstruction module.
  • Figure 2: Comparison between fine-grained text generated by LLaVA-1.5 7B and coarse-grained text.
  • Figure 3: Visualization of top-5 retrieval examples for the retrieval task.
  • Figure 4: The results of temporal and spatial analysis on THINGS-EEG dataset. (A) The average top-1 retrieval accuracy of all subjects using different EEG time windows: [0, $t$], [$t$-100, $t$], and [$t$, 1000]. (B) Retrieval performance using electrode channels from different brain areas.
  • Figure 5: We compare the images reconstructed using UMind with the ground truth images, including those that show the best, median, and worst correspondence to the original stimulus images.
  • ...and 2 more figures