UMind: A Unified Multitask Network for Zero-Shot M/EEG Visual Decoding
Chengjian Xu, Yonghao Song, Zelin Liao, Haochuan Zhang, Qiong Wang, Qingqing Zheng
TL;DR
UMind tackles zero-shot visual decoding from time-resolved M/EEG signals by proposing a unified multitask framework that jointly retrieves, classifies, and reconstructs visual stimuli. It achieves this through multimodal alignment of M/EEG with both images and dual-granularity text (coarse labels and fine-grained captions) and uses the resulting neural-visual and neural-semantic representations as dual conditions for a diffusion-based image reconstruction model. The approach leverages contrastive and MSE losses with frozen CLIP encoders, dual guidance via a diffusion prior and Q-Former, and SDXL-Turbo for high-quality generation, yielding state-of-the-art retrieval, classification, and reconstruction on THINGS-EEG and THINGS-MEG datasets. These findings highlight the value of integrating semantic information into neural decoding pipelines, enabling more accurate, interpretable, and semantically rich visual reconstructions for brain-computer interfaces.
Abstract
Decoding visual information from time-resolved brain recordings, such as EEG and MEG, plays a pivotal role in real-time brain-computer interfaces. However, existing approaches primarily focus on direct brain-image feature alignment and are limited to single-task frameworks or task-specific models. In this paper, we propose a Unified MultItask Network for zero-shot M/EEG visual Decoding (referred to UMind), including visual stimulus retrieval, classification, and reconstruction, where multiple tasks mutually enhance each other. Our method learns robust neural-visual and semantic representations through multimodal alignment with both image and text modalities. The integration of both coarse and fine-grained texts enhances the extraction of these neural representations, enabling more detailed semantic and visual decoding. These representations then serve as dual conditional inputs to a pre-trained diffusion model, guiding visual reconstruction from both visual and semantic perspectives. Extensive evaluations on MEG and EEG datasets demonstrate the effectiveness, robustness, and biological plausibility of our approach in capturing spatiotemporal neural dynamics. Our approach sets a multitask pipeline for brain visual decoding, highlighting the synergy of semantic information in visual feature extraction. The code is available at https://github.com/xuchengjian632/UMind.
