CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment

Kaifan Zhang; Lihuo He; Junjie Ke; Yuqi Ji; Lukun Wu; Lizi Wang; Xinbo Gao

CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment

Kaifan Zhang, Lihuo He, Junjie Ke, Yuqi Ji, Lukun Wu, Lizi Wang, Xinbo Gao

Abstract

Visual stimuli reconstruction from EEG remains challenging due to fidelity loss and representation shift. We propose CognitionCapturerPro, an enhanced framework that integrates EEG with multi-modal priors (images, text, depth, and edges) via collaborative training. Our core contributions include an uncertainty-weighted similarity scoring mechanism to quantify modality-specific fidelity and a fusion encoder for integrating shared representations. By employing a simplified alignment module and a pre-trained diffusion model, our method significantly outperforms the original CognitionCapturer on the THINGS-EEG dataset, improving Top-1 and Top-5 retrieval accuracy by 25.9% and 10.6%, respectively. Code is available at: https://github.com/XiaoZhangYES/CognitionCapturerPro.

CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment

Abstract

Paper Structure (42 sections, 12 equations, 8 figures, 6 tables)

This paper contains 42 sections, 12 equations, 8 figures, 6 tables.

Introduction
Related Work
Brain Decoding
Cross-modal Contrastive Learning
Generative Visual Decoding
Method
Overview
Uncertainty-Weighted Masking (UM)
Fovea-Inspired Spatially-Variant Blurring
Uncertainty-Driven Dynamic Selection
Modality Expert Encoder
Fusion Encoder
Input Projection and Tokenization
Cross-modal Interaction
Aggregation and Robust Training
...and 27 more sections

Figures (8)

Figure 1: Core challenges in EEG-based visual decoding. We categorize the misalignment between brain signals and visual stimuli into two aspects: (Left) Representational Shift, where the brain's associative mechanisms introduce non-visual semantics that deviate from the pixel-level content; and (Right) Fidelity Loss, where selective attention and perceptual uncertainty result in incomplete or localized neural capture of the original image.
Figure 2: Overview of CognitionCapturerPro: Figure (a) depicts the encoder training process, where multimodal data such as EEG and images are processed by independent encoders, integrated through a fusion encoder, and constrained by an improved contrastive loss. (b) depicts the SCM-Loss, which filters positive pairs using semantic labels and similarity to address one-to-many mappings. (c) depicts the STH-Align structure, which maps embeddings from various modalities into a unified image space via a shared backbone and multimodal projection heads. (d) depicts how the aligned multimodal embeddings are conditionally injected through the SDXL-Turbo model with a multi-branch IP-Adapter to reconstruct semantically consistent and high-fidelity images.
Figure 3: Qualitative comparison of image reconstruction results on the Things-EEG dataset. The first row displays the ground-truth visual stimuli. Subsequent rows show reconstructions generated by CogCapPro, CogCap, and ATM, respectively.
Figure 4: Qualitative comparison of reconstruction results across different modality configurations and alignment strategies. From top to bottom: the original visual stimulus, results from all modalities, single modalities, and variants without our alignment module. Our full-modality approach effectively integrates complementary features to produce the most consistent reconstructions.
Figure 5: Top-1 accuracy distribution across spatial and spectral dimensions. The top row illustrates the classification performance across different brain regions (Frontal, Temporal, Central, Parietal, and Occipital), while the bottom row shows the performance across five frequency bands (Delta, Theta, Alpha, Beta, and Gamma). Left and right columns represent results for the Things-EEG and Things-MEG datasets, respectively. "All Channels" denotes the integration of all spatial or spectral information.
...and 3 more figures

CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment

Abstract

CognitionCapturerPro: Towards High-Fidelity Visual Decoding from EEG/MEG via Multi-modal Information and Asymmetric Alignment

Authors

Abstract

Table of Contents

Figures (8)