Table of Contents
Fetching ...

Decoding Natural Images from EEG for Object Recognition

Yonghao Song, Bingchuan Liu, Xiang Li, Nanlin Shi, Yijun Wang, Xiaorong Gao

TL;DR

This work tackles decoding natural images from EEG for object recognition by introducing NICE, a self-supervised cross-modal framework that learns image representations from EEG via contrastive learning. It combines a temporal-spatial EEG encoder (TSConv) with plug-and-play spatial modules (self-attention and graph attention) and explores pre-trained image encoders to achieve cross-modal alignment, yielding notable zero-shot performance on a large 200-way task, including top-1 and top-5 metrics of $15.6\%$ and $42.8\%$ respectively in challenging settings. The authors provide extensive analyses of temporal, spatial, and spectral dynamics and show that the learned EEG representations capture plausible brain activity patterns in occipital and temporal regions, supporting biological plausibility. The work also highlights practical implications for neural decoding and brain-computer interfaces and releases code to facilitate future research.

Abstract

Electroencephalography (EEG) signals, known for convenient non-invasive acquisition but low signal-to-noise ratio, have recently gained substantial attention due to the potential to decode natural images. This paper presents a self-supervised framework to demonstrate the feasibility of learning image representations from EEG signals, particularly for object recognition. The framework utilizes image and EEG encoders to extract features from paired image stimuli and EEG responses. Contrastive learning aligns these two modalities by constraining their similarity. With the framework, we attain significantly above-chance results on a comprehensive EEG-image dataset, achieving a top-1 accuracy of 15.6% and a top-5 accuracy of 42.8% in challenging 200-way zero-shot tasks. Moreover, we perform extensive experiments to explore the biological plausibility by resolving the temporal, spatial, spectral, and semantic aspects of EEG signals. Besides, we introduce attention modules to capture spatial correlations, providing implicit evidence of the brain activity perceived from EEG data. These findings yield valuable insights for neural decoding and brain-computer interfaces in real-world scenarios. The code will be released on https://github.com/eeyhsong/NICE-EEG.

Decoding Natural Images from EEG for Object Recognition

TL;DR

This work tackles decoding natural images from EEG for object recognition by introducing NICE, a self-supervised cross-modal framework that learns image representations from EEG via contrastive learning. It combines a temporal-spatial EEG encoder (TSConv) with plug-and-play spatial modules (self-attention and graph attention) and explores pre-trained image encoders to achieve cross-modal alignment, yielding notable zero-shot performance on a large 200-way task, including top-1 and top-5 metrics of and respectively in challenging settings. The authors provide extensive analyses of temporal, spatial, and spectral dynamics and show that the learned EEG representations capture plausible brain activity patterns in occipital and temporal regions, supporting biological plausibility. The work also highlights practical implications for neural decoding and brain-computer interfaces and releases code to facilitate future research.

Abstract

Electroencephalography (EEG) signals, known for convenient non-invasive acquisition but low signal-to-noise ratio, have recently gained substantial attention due to the potential to decode natural images. This paper presents a self-supervised framework to demonstrate the feasibility of learning image representations from EEG signals, particularly for object recognition. The framework utilizes image and EEG encoders to extract features from paired image stimuli and EEG responses. Contrastive learning aligns these two modalities by constraining their similarity. With the framework, we attain significantly above-chance results on a comprehensive EEG-image dataset, achieving a top-1 accuracy of 15.6% and a top-5 accuracy of 42.8% in challenging 200-way zero-shot tasks. Moreover, we perform extensive experiments to explore the biological plausibility by resolving the temporal, spatial, spectral, and semantic aspects of EEG signals. Besides, we introduce attention modules to capture spatial correlations, providing implicit evidence of the brain activity perceived from EEG data. These findings yield valuable insights for neural decoding and brain-computer interfaces in real-world scenarios. The code will be released on https://github.com/eeyhsong/NICE-EEG.
Paper Structure (30 sections, 3 equations, 7 figures, 11 tables, 1 algorithm)

This paper contains 30 sections, 3 equations, 7 figures, 11 tables, 1 algorithm.

Figures (7)

  • Figure 1: (A) Overall framework for EEG-based object recognition. During training, image-EEG pairs are processed by an image encoder (pre-trained) and an EEG encoder. The objective is to increase the similarity between matched pairs while decreasing it for unmatched pairs. During testing, a few unseen images of target concepts (classes) are processed in advance into templates. Then, the model obtains results by matching test data to templates. (B) Architecture of the EEG encoder. Temporal-spatial convolution is used with spatial modules, made with self and graph attention, to reveal spatial features of brain activity. The linear layer is used to project the feature dimension.
  • Figure 2: Temporal, spatial, spectral analysis. (A) Topographies of each 100 ms by averaging all training trials. The temporal lobe has a clear response between 100-600 ms. (B) Averaged accuracy of all subjects with different time lengths. The region of interest is 100-600 ms, after hysteresis in visual systems. (C) Ablate electrodes of different brain regions. The occipital, temporal, and parietal lobes contribute significantly to image decoding. (D) Time-frequency maps of the occipital, temporal, and parietal lobes data from one subject. The main components are below 30 Hz, and high-frequency components can be observed on the temporal lobe. (E) Averaged accuracy of different rhythms. Theta ($\sim$4 Hz) and beta ($\sim$14-18 Hz) bands show effective performance.
  • Figure 3: Semantic similarity analysis and visualization. (A) Cosine similarity of feature pairs of 200 concepts in the test set. The results calculated by the trained models of 10 subjects were averaged, and all the concepts were rearranged into five categories: animal, food, vehicle, tool, and others. (B) Classification results visualized with ground truth (first column) and the top-5 predicted.
  • Figure 4: Effect of data size and repetition in training and test set, and visualization of SA and GA. (A) Accuracy with quarters of conditions and all repetitions, with quarters of repetitions and all conditions of training images. Adding more conditions can potentially further improve the performance. (B) Accuracy with different repetitions of the test images. Average ten times to achieve an accuracy of 9.9 (30.1)%. (C) Grad-CAMs of SA and GA show activation on temporal and occipital regions.
  • Figure 5: (A) Grad-CAM for self-attention (SA) and graph attention (GA) modules of individual subjects and the average across subjects. (B) Attention weights of SA and GA with individual subjects and the average across subjects. The visualizations highlight the regions of interest, focusing on the temporal and occipital brain areas, which are known to be associated with visual processing.
  • ...and 2 more figures