Table of Contents
Fetching ...

Integrating Language-Image Prior into EEG Decoding for Cross-Task Zero-Calibration RSVP-BCI

Xujin Li, Wei Wei, Shuang Qiu, Xinyi Zhang, Fu Li, Huiguang He

TL;DR

This work tackles the problem of cross-task zero-calibration RSVP-BCI decoding where models trained on one RSVP task struggle to generalize to unseen tasks. It introduces ELIPformer, a transformer-based architecture that fuses EEG with language-image priors using a CLIP-based prompt encoder and a cross bi-attention mechanism to align modalities. The authors design three RSVP tasks and provide an open dataset across 71 subjects, showing that ELIPformer achieves superior cross-task decoding performance over conventional, CNN-based, and Transformer baselines. The results demonstrate effective semantic alignment between EEG and language-image features and highlight the approach's potential for rapid, practical deployment of RSVP-BCI systems in diverse scenarios.

Abstract

Rapid Serial Visual Presentation (RSVP)-based Brain-Computer Interface (BCI) is an effective technology used for information detection by detecting Event-Related Potentials (ERPs). The current RSVP decoding methods can perform well in decoding EEG signals within a single RSVP task, but their decoding performance significantly decreases when directly applied to different RSVP tasks without calibration data from the new tasks. This limits the rapid and efficient deployment of RSVP-BCI systems for detecting different categories of targets in various scenarios. To overcome this limitation, this study aims to enhance the cross-task zero-calibration RSVP decoding performance. First, we design three distinct RSVP tasks for target image retrieval and build an open-source dataset containing EEG signals and corresponding stimulus images. Then we propose an EEG with Language-Image Prior fusion Transformer (ELIPformer) for cross-task zero-calibration RSVP decoding. Specifically, we propose a prompt encoder based on the language-image pre-trained model to extract language-image features from task-specific prompts and stimulus images as prior knowledge for enhancing EEG decoding. A cross bidirectional attention mechanism is also adopted to facilitate the effective feature fusion and alignment between the EEG and language-image features. Extensive experiments demonstrate that the proposed model achieves superior performance in cross-task zero-calibration RSVP decoding, which promotes the RSVP-BCI system from research to practical application.

Integrating Language-Image Prior into EEG Decoding for Cross-Task Zero-Calibration RSVP-BCI

TL;DR

This work tackles the problem of cross-task zero-calibration RSVP-BCI decoding where models trained on one RSVP task struggle to generalize to unseen tasks. It introduces ELIPformer, a transformer-based architecture that fuses EEG with language-image priors using a CLIP-based prompt encoder and a cross bi-attention mechanism to align modalities. The authors design three RSVP tasks and provide an open dataset across 71 subjects, showing that ELIPformer achieves superior cross-task decoding performance over conventional, CNN-based, and Transformer baselines. The results demonstrate effective semantic alignment between EEG and language-image features and highlight the approach's potential for rapid, practical deployment of RSVP-BCI systems in diverse scenarios.

Abstract

Rapid Serial Visual Presentation (RSVP)-based Brain-Computer Interface (BCI) is an effective technology used for information detection by detecting Event-Related Potentials (ERPs). The current RSVP decoding methods can perform well in decoding EEG signals within a single RSVP task, but their decoding performance significantly decreases when directly applied to different RSVP tasks without calibration data from the new tasks. This limits the rapid and efficient deployment of RSVP-BCI systems for detecting different categories of targets in various scenarios. To overcome this limitation, this study aims to enhance the cross-task zero-calibration RSVP decoding performance. First, we design three distinct RSVP tasks for target image retrieval and build an open-source dataset containing EEG signals and corresponding stimulus images. Then we propose an EEG with Language-Image Prior fusion Transformer (ELIPformer) for cross-task zero-calibration RSVP decoding. Specifically, we propose a prompt encoder based on the language-image pre-trained model to extract language-image features from task-specific prompts and stimulus images as prior knowledge for enhancing EEG decoding. A cross bidirectional attention mechanism is also adopted to facilitate the effective feature fusion and alignment between the EEG and language-image features. Extensive experiments demonstrate that the proposed model achieves superior performance in cross-task zero-calibration RSVP decoding, which promotes the RSVP-BCI system from research to practical application.
Paper Structure (38 sections, 12 equations, 11 figures, 4 tables)

This paper contains 38 sections, 12 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: The diagram of (a) cross-task zero-calibration decoding and (b) RSVP-EEG decoding and integrating language-image prior into EEG decoding. (a) The RSVP decoding model is trained on EEG signals from training subjects performing existing RSVP tasks (i.e. Task plane), and can be directly used to efficiently classify EEG signals from new subjects performing new RSVP tasks (i.e. Task car). (b) The image sequence is presented to the subject at a high rate (e.g., 10 Hz) while the subject's EEG signals are recorded. The decoding model identifies ERPs in EEG signals to classify corresponding stimulus images as target images. Language-image features are extracted from task-specific prompts and stimulus images using language-image models to enhance RSVP decoding.
  • Figure 2: The RSVP-based target image retrieval experiment. (a) Examples of target and nontarget images in the three tasks, (b) settings of our RSVP experiment. The rest time between adjacent blocks is around 1-3 minutes. The rest time between adjacent sequences is controlled by subjects, around 4-6 s.
  • Figure 3: The structure of the proposed ELIPformer. (a) ELIPformer consists of the feature extractor, the prompt encoder, the cross bi-attention module, and the fusion module, where the model takes raw EEG signals ($\boldsymbol{S}_{eeg}$), corresponding stimulus images ($\boldsymbol{S}_{img}$), and task-specific prompts as input. Initially, the feature extractor extracts EEG features ($\boldsymbol{X}_{eeg}$), and the prompt encoder extracts language-image features ($\boldsymbol{Y}_{LI}$), respectively. Subsequently, the cross bi-attention module facilitates the modal interaction between extracted EEG and image tokens. Finally, the fusion module combines the output EEG and image tokens as fusion features ($\boldsymbol{x}_{f}$) for classification.
  • Figure 4: The structure of the prompt encoder. The prompt encoder consists of components from the pre-trained CLIP-ViT-B/32 model radford2021learning. Both the image encoder and text encoder are inherited from this model. Additionally, the patch embedding layer and transformer layers are derived from the image encoder in CLIP-ViT-B/32.
  • Figure 5: The structure of cross bi-attention module. The cross bi-attention module is composed of $N_{cross}$ successive cross bi-attention layers for effective interaction between EEG features ($\boldsymbol{X}_{eeg}$) and language-image features ($\boldsymbol{Y}_{LI}$).
  • ...and 6 more figures