Table of Contents
Fetching ...

Mind's Eye: Image Recognition by EEG via Multimodal Similarity-Keeping Contrastive Learning

Chi-Sheng Chen, Chun-Shu Wei

TL;DR

This work tackles zero-shot EEG-based image recognition by proposing MUSE, a self-supervised, multimodal framework that aligns EEG and image embeddings while preserving intra-batch similarities. It combines EEG encoders (STConv/NervFormer) with an off-the-shelf CLIP-ViT image encoder and introduces a similarity-keeping loss that regularizes cross-modal contrastive learning via a trainable parameter $\beta$, yielding the combined objective $\mathcal{L}_{SK-InfoNCE} = \mathcal{L}_{InfoNCE} + \beta \times \mathcal{L}_{SK}$. Empirical results on the THINGS EEG RSVP dataset show state-of-the-art zero-shot performance (top-1 $19.3\%$, top-5 $48.9\%$) and robust improvements across variants, with interpretability analyses (Grad-CAM) illuminating Occipital-Parietal dynamics in the $100$–$500$ ms window and associated alpha/gamma-band activity. These findings demonstrate that brain-inspired contrastive learning can effectively bridge temporal EEG signals and visual semantics, enabling more flexible, non-invasive brain–computer interface capabilities.

Abstract

Decoding images from non-invasive electroencephalographic (EEG) signals has been a grand challenge in understanding how the human brain process visual information in real-world scenarios. To cope with the issues of signal-to-noise ratio and nonstationarity, this paper introduces a MUltimodal Similarity-keeping contrastivE learning (MUSE) framework for zero-shot EEG-based image classification. We develop a series of multivariate time-series encoders tailored for EEG signals and assess the efficacy of regularized contrastive EEG-Image pretraining using an extensive visual EEG dataset. Our method achieves state-of-the-art performance, with a top-1 accuracy of 19.3% and a top-5 accuracy of 48.8% in 200-way zero-shot image classification. Furthermore, we visualize neural patterns via model interpretation, shedding light on the visual processing dynamics in the human brain. The code repository for this work is available at: https://github.com/ChiShengChen/MUSE_EEG.

Mind's Eye: Image Recognition by EEG via Multimodal Similarity-Keeping Contrastive Learning

TL;DR

This work tackles zero-shot EEG-based image recognition by proposing MUSE, a self-supervised, multimodal framework that aligns EEG and image embeddings while preserving intra-batch similarities. It combines EEG encoders (STConv/NervFormer) with an off-the-shelf CLIP-ViT image encoder and introduces a similarity-keeping loss that regularizes cross-modal contrastive learning via a trainable parameter , yielding the combined objective . Empirical results on the THINGS EEG RSVP dataset show state-of-the-art zero-shot performance (top-1 , top-5 ) and robust improvements across variants, with interpretability analyses (Grad-CAM) illuminating Occipital-Parietal dynamics in the ms window and associated alpha/gamma-band activity. These findings demonstrate that brain-inspired contrastive learning can effectively bridge temporal EEG signals and visual semantics, enabling more flexible, non-invasive brain–computer interface capabilities.

Abstract

Decoding images from non-invasive electroencephalographic (EEG) signals has been a grand challenge in understanding how the human brain process visual information in real-world scenarios. To cope with the issues of signal-to-noise ratio and nonstationarity, this paper introduces a MUltimodal Similarity-keeping contrastivE learning (MUSE) framework for zero-shot EEG-based image classification. We develop a series of multivariate time-series encoders tailored for EEG signals and assess the efficacy of regularized contrastive EEG-Image pretraining using an extensive visual EEG dataset. Our method achieves state-of-the-art performance, with a top-1 accuracy of 19.3% and a top-5 accuracy of 48.8% in 200-way zero-shot image classification. Furthermore, we visualize neural patterns via model interpretation, shedding light on the visual processing dynamics in the human brain. The code repository for this work is available at: https://github.com/ChiShengChen/MUSE_EEG.

Paper Structure

This paper contains 24 sections, 5 equations, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: Schematic illustration of the proposed MUltimodal Similarity-keeping contrastivE learning (MUSE) framework. During the training phase, EEG-image pairs are independently processed by an EEG encoder and an image encoder. The objectives of the MUSE framework are twofold: 1) maximize the separation between matched and unmatched pairs, and 2) maintain the inner-batch sample similarity within each EEG-image pair (see Algorithm \ref{['alg:1']} for details). In the test phase, an unseen EEG sample is passed through the EEG encoder, which identifies the most similar image from a set of unseen images based on cross-modality embedding similarity.
  • Figure 2: (a.) The whole view of this work. (b.) Illustration on feature space of multimodal similarity-keeping contrastive learning framework (MUSE), different from traditional contrastive learning only focus on multimodal similarity, MUSE both consider the multimodal similarity and inner batch similarity in the loss function. $r$ denotes representation. $I$ and $E$ denotes image and EEG signal, respectively.
  • Figure 3: The details of the MUSE. (a.) The contrastive learning loss is calculated from EEG encoding and image encoding. (b.)(c.) The similarity-keeping loss comes from the final similarity of self-batch similarity of the input modal data.
  • Figure 4: The model structure comparison. Where BN denotes batch normalization, IN denotes instance normalization, LN denotes layer normalization, respectively.
  • Figure 5: Overall Top-1 zero-shot accuracy comparison of all models.
  • ...and 9 more figures