Table of Contents
Fetching ...

RealMind: Advancing Visual Decoding and Language Interaction via EEG Signals

Dongyang Li, Haoyang Qin, Mingyang Wu, Jiahua Tang, Yuang Cao, Chen Wei, Quanying Liu

TL;DR

RealMind tackles the challenge of decoding visual experiences from EEG by learning multimodal-aligned representations with semantic and geometric consistency losses. It uses a Transformer-based EEG encoder to map multi-channel EEG to latent spaces aligned with CLIP and large-language models, enabling retrieval, reconstruction, and the first zero-shot EEG captioning. On the THINGS-EEG dataset, RealMind achieves Top-1 27.58% and Top-5 58.42% in 200-class zero-shot retrieval and BLEU-1 26.59% in 200-class zero-shot captioning, demonstrating strong multitask performance and cross-modal alignment. This work advances practical EEG-based visual decoding by enabling captioning and by providing a scalable, interpretable architecture for BCI applications.

Abstract

Decoding visual stimuli from neural recordings is a critical challenge in the development of brain-computer interfaces (BCIs). Although recent EEG-based decoding approaches have made progress in tasks such as visual classification, retrieval, and reconstruction, they remain constrained by unstable representation learning and a lack of interpretability. This gap highlights the need for more efficient representation learning and the integration of effective language interaction to enhance both understanding and practical usability in visual decoding tasks.To address this limitation, we introduce RealMind, a novel EEG-based framework designed to handle a diverse range of downstream tasks. Specifically, RealMind leverages both semantic and geometric consistency learning to enhance feature representation and improve alignment across tasks. Notably, beyond excelling in traditional tasks, our framework marks the first attempt at visual captioning from EEG data through vision-language model (VLM). It achieves a Top-1 decoding accuracy of 27.58% in a 200-class zero-shot retrieval task and a BLEU-1 score of 26.59% in a 200-class zero-shot captioning task. Overall, RealMind provides a comprehensive multitask EEG decoding framework, establishing a foundational approach for EEG-based visual decoding in real-world applications.

RealMind: Advancing Visual Decoding and Language Interaction via EEG Signals

TL;DR

RealMind tackles the challenge of decoding visual experiences from EEG by learning multimodal-aligned representations with semantic and geometric consistency losses. It uses a Transformer-based EEG encoder to map multi-channel EEG to latent spaces aligned with CLIP and large-language models, enabling retrieval, reconstruction, and the first zero-shot EEG captioning. On the THINGS-EEG dataset, RealMind achieves Top-1 27.58% and Top-5 58.42% in 200-class zero-shot retrieval and BLEU-1 26.59% in 200-class zero-shot captioning, demonstrating strong multitask performance and cross-modal alignment. This work advances practical EEG-based visual decoding by enabling captioning and by providing a scalable, interpretable architecture for BCI applications.

Abstract

Decoding visual stimuli from neural recordings is a critical challenge in the development of brain-computer interfaces (BCIs). Although recent EEG-based decoding approaches have made progress in tasks such as visual classification, retrieval, and reconstruction, they remain constrained by unstable representation learning and a lack of interpretability. This gap highlights the need for more efficient representation learning and the integration of effective language interaction to enhance both understanding and practical usability in visual decoding tasks.To address this limitation, we introduce RealMind, a novel EEG-based framework designed to handle a diverse range of downstream tasks. Specifically, RealMind leverages both semantic and geometric consistency learning to enhance feature representation and improve alignment across tasks. Notably, beyond excelling in traditional tasks, our framework marks the first attempt at visual captioning from EEG data through vision-language model (VLM). It achieves a Top-1 decoding accuracy of 27.58% in a 200-class zero-shot retrieval task and a BLEU-1 score of 26.59% in a 200-class zero-shot captioning task. Overall, RealMind provides a comprehensive multitask EEG decoding framework, establishing a foundational approach for EEG-based visual decoding in real-world applications.

Paper Structure

This paper contains 10 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Conceptual Overview. Left: The RealMind framework utilizes a multilevel representation learning strategy to align both the semantic and geometric representations of images and EEG data, incorporating constraints that enforce consistency across both modalities in terms of their underlying structure and semantics. Right: The aligned EEG representations facilitate the execution of a range of downstream decoding tasks, including retrieval, reconstruction, and caption generation, among others.
  • Figure 2: RealMind framework. Top: A Transformer-based model projects EEG signals to an latent space. Middle: The EEG latent with a shape of 1×1024 is aligned with the CLIP ViT-H-14 embeddings for retrieval and reconstruction of the corresponding image by SDXL. The EEG latent with a shape of 256×1024 is aligned with the CLIP ViT-L-14 embeddings to generate descriptive captions through a pre-trained large language model (LLM). Bottom: Brain activities that are semantically similar exhibit analogous neural patterns, and similar objects elicit comparable neural responses regardless of labels. This underscores the importance of geometric properties in compressed EEG and image representations within the feature space.
  • Figure 3: Samples of EEG-based image captions generated by the RealMind framework. We present three examples of images from subject-08. Left: From left to right, the original image is followed by three reconstructed images. Right: From top to bottom, the captions are shown for the original image (i.e., ground truth), the three reconstructed images (i.e., Lat2Rec caption), and the EEG latent representation (i.e., Latent caption).
  • Figure 4: Examples of generated answers using RealMind. Different task prompts for the same input brain signal result in unique outcomes.