Table of Contents
Fetching ...

A Temporal-Spectral Fusion Transformer with Subject-Specific Adapter for Enhancing RSVP-BCI Decoding

Xujin Li, Wei Wei, Shuang Qiu, Huiguang He

TL;DR

This work tackles the challenge of RSVP-BCI decoding when new-subject data are scarce by proposing TSformer-SA, a Temporal-Spectral fusion Transformer with a lightweight subject-specific adapter. It jointly models EEG temporal signals and CWT-based spectral spectrograms through a cross-view interaction module, multi-view consistency loss, and an attention-based fusion head, with a two-stage training strategy that pre-trains on existing subjects and fine-tunes only the adapter for new subjects. Empirical results show that TSformer-SA consistently outperforms conventional ML, CNN-based, and Transformer baselines in subject-dependent and subject-independent settings, while dramatically reducing preparation time and data requirements. The approach yields strong generalization, rapid deployment potential, and improved efficiency for practical RSVP-BCI systems, with ablations validating the contribution of each component and analyses confirming robustness to data variations and wavelet choices.

Abstract

The Rapid Serial Visual Presentation (RSVP)-based Brain-Computer Interface (BCI) is an efficient technology for target retrieval using electroencephalography (EEG) signals. The performance improvement of traditional decoding methods relies on a substantial amount of training data from new test subjects, which increases preparation time for BCI systems. Several studies introduce data from existing subjects to reduce the dependence of performance improvement on data from new subjects, but their optimization strategy based on adversarial learning with extensive data increases training time during the preparation procedure. Moreover, most previous methods only focus on the single-view information of EEG signals, but ignore the information from other views which may further improve performance. To enhance decoding performance while reducing preparation time, we propose a Temporal-Spectral fusion transformer with Subject-specific Adapter (TSformer-SA). Specifically, a cross-view interaction module is proposed to facilitate information transfer and extract common representations across two-view features extracted from EEG temporal signals and spectrogram images. Then, an attention-based fusion module fuses the features of two views to obtain comprehensive discriminative features for classification. Furthermore, a multi-view consistency loss is proposed to maximize the feature similarity between two views of the same EEG signal. Finally, we propose a subject-specific adapter to rapidly transfer the knowledge of the model trained on data from existing subjects to decode data from new subjects. Experimental results show that TSformer-SA significantly outperforms comparison methods and achieves outstanding performance with limited training data from new subjects. This facilitates efficient decoding and rapid deployment of BCI systems in practical use.

A Temporal-Spectral Fusion Transformer with Subject-Specific Adapter for Enhancing RSVP-BCI Decoding

TL;DR

This work tackles the challenge of RSVP-BCI decoding when new-subject data are scarce by proposing TSformer-SA, a Temporal-Spectral fusion Transformer with a lightweight subject-specific adapter. It jointly models EEG temporal signals and CWT-based spectral spectrograms through a cross-view interaction module, multi-view consistency loss, and an attention-based fusion head, with a two-stage training strategy that pre-trains on existing subjects and fine-tunes only the adapter for new subjects. Empirical results show that TSformer-SA consistently outperforms conventional ML, CNN-based, and Transformer baselines in subject-dependent and subject-independent settings, while dramatically reducing preparation time and data requirements. The approach yields strong generalization, rapid deployment potential, and improved efficiency for practical RSVP-BCI systems, with ablations validating the contribution of each component and analyses confirming robustness to data variations and wavelet choices.

Abstract

The Rapid Serial Visual Presentation (RSVP)-based Brain-Computer Interface (BCI) is an efficient technology for target retrieval using electroencephalography (EEG) signals. The performance improvement of traditional decoding methods relies on a substantial amount of training data from new test subjects, which increases preparation time for BCI systems. Several studies introduce data from existing subjects to reduce the dependence of performance improvement on data from new subjects, but their optimization strategy based on adversarial learning with extensive data increases training time during the preparation procedure. Moreover, most previous methods only focus on the single-view information of EEG signals, but ignore the information from other views which may further improve performance. To enhance decoding performance while reducing preparation time, we propose a Temporal-Spectral fusion transformer with Subject-specific Adapter (TSformer-SA). Specifically, a cross-view interaction module is proposed to facilitate information transfer and extract common representations across two-view features extracted from EEG temporal signals and spectrogram images. Then, an attention-based fusion module fuses the features of two views to obtain comprehensive discriminative features for classification. Furthermore, a multi-view consistency loss is proposed to maximize the feature similarity between two views of the same EEG signal. Finally, we propose a subject-specific adapter to rapidly transfer the knowledge of the model trained on data from existing subjects to decode data from new subjects. Experimental results show that TSformer-SA significantly outperforms comparison methods and achieves outstanding performance with limited training data from new subjects. This facilitates efficient decoding and rapid deployment of BCI systems in practical use.
Paper Structure (44 sections, 14 equations, 10 figures, 9 tables)

This paper contains 44 sections, 14 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: The flowchart of the RSVP-BCI system for target retrieval. The image sequence is presented to the subject at a high rate and the EEG signals of the subject are recorded simultaneously. Subsequently, the RSVP decoding model detects EEG signals containing P300 components, and their corresponding stimulus images are the target images.
  • Figure 2: The framework of our proposed two-stage training strategy. The model is initially pre-trained on the EEG signals from existing subjects before the preparation procedure of the BCI system. During the preparation procedure, the training EEG signals from the new subject are used to fine-tune the adapter of the model. Then the model is utilized to decode EEG signals from the new subject.
  • Figure 3: Illustration of the RSVP paradigm. (a) Examples of target and nontarget images in Task plane, Task car, and Task people. The stimulus images in Task plane are sourced from the remote sensing Dior dataset li2020object, the stimulus images of the Task car are from our self-collection drone aerial images, and the stimulus images of the Task people are from the scenes and objects database torralba2009csail. (b) Experimental settings about the division of blocks and sequences for each subject.
  • Figure 4: The structure of our proposed TSformer-SA. The $F(\cdot)$ represents the token score function, while $\mathcal{L}_{cls}$ and $\mathcal{L}_{multi\hbox{-}view}$ respectively denote the cross-entropy loss and the multi-view consistency loss. The inputs consist of EEG temporal signals representing the temporal view and the spectrogram images representing the spectral view. The feature extractor tokenizes the inputs and extracts the view-specific features. Subsequently, the cross-view interaction module extracts the common features from both views and the fusion module fuses the two-view features for classification. The above three modules are trained during the pre-training stage and only the subject-specific adapter is trained in the fine-tuning stage.
  • Figure 5: The experimental flow charts of (a) subject-dependent decoding, (b) subject-independent decoding, and (c) two-stage training.
  • ...and 5 more figures