Table of Contents
Fetching ...

Decoding Covert Speech from EEG Using a Functional Areas Spatio-Temporal Transformer

Muyun Jiang, Yi Ding, Wei Zhang, Kok Ann Colin Teo, LaiGuan Fong, Shuailei Zhang, Zhiwei Guo, Chenyu Liu, Raghavan Bhuvanakantham, Wei Khang Jeremy Sim, Chuan Huat Vince Foo, Rong Hui Jonathan Chua, Parasuraman Padmanabhan, Victoria Leong, Jia Lu, Balazs Gulyas, Cuntai Guan

TL;DR

A large-scale multi-utterance speech EEG dataset from 57 right-handed native English-speaking subjects, each performing covert and overt speech tasks by repeating the same word in five utterances within a ten-second duration is developed, which provides interpretable evidence for speech decoding from EEG.

Abstract

Covert speech involves imagining speaking without audible sound or any movements. Decoding covert speech from electroencephalogram (EEG) is challenging due to a limited understanding of neural pronunciation mapping and the low signal-to-noise ratio of the signal. In this study, we developed a large-scale multi-utterance speech EEG dataset from 57 right-handed native English-speaking subjects, each performing covert and overt speech tasks by repeating the same word in five utterances within a ten-second duration. Given the spatio-temporal nature of the neural activation process during speech pronunciation, we developed a Functional Areas Spatio-temporal Transformer (FAST), an effective framework for converting EEG signals into tokens and utilizing transformer architecture for sequence encoding. Our results reveal distinct and interpretable speech neural features by the visualization of FAST-generated activation maps across frontal and temporal brain regions with each word being covertly spoken, providing new insights into the discriminative features of the neural representation of covert speech. This is the first report of such a study, which provides interpretable evidence for speech decoding from EEG. The code for this work has been made public at https://github.com/Jiang-Muyun/FAST

Decoding Covert Speech from EEG Using a Functional Areas Spatio-Temporal Transformer

TL;DR

A large-scale multi-utterance speech EEG dataset from 57 right-handed native English-speaking subjects, each performing covert and overt speech tasks by repeating the same word in five utterances within a ten-second duration is developed, which provides interpretable evidence for speech decoding from EEG.

Abstract

Covert speech involves imagining speaking without audible sound or any movements. Decoding covert speech from electroencephalogram (EEG) is challenging due to a limited understanding of neural pronunciation mapping and the low signal-to-noise ratio of the signal. In this study, we developed a large-scale multi-utterance speech EEG dataset from 57 right-handed native English-speaking subjects, each performing covert and overt speech tasks by repeating the same word in five utterances within a ten-second duration. Given the spatio-temporal nature of the neural activation process during speech pronunciation, we developed a Functional Areas Spatio-temporal Transformer (FAST), an effective framework for converting EEG signals into tokens and utilizing transformer architecture for sequence encoding. Our results reveal distinct and interpretable speech neural features by the visualization of FAST-generated activation maps across frontal and temporal brain regions with each word being covertly spoken, providing new insights into the discriminative features of the neural representation of covert speech. This is the first report of such a study, which provides interpretable evidence for speech decoding from EEG. The code for this work has been made public at https://github.com/Jiang-Muyun/FAST

Paper Structure

This paper contains 25 sections, 24 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of proposed FAST. (a) The Spatial-temporal Tokenizer (ST) block illustrates the initial processing of EEG data through spatial and temporal convolutional layers. (b) The Transformer Encoder (TE) block shows the transformer architecture used for tokens generated by ST, which outputs a learned CLS token for classification results.
  • Figure 2: The divided brain functional regions are based on the spatial locations of EEG channels, where FCz serves as the reference electrode, and AFz serves as the ground electrode.
  • Figure 3: Protocol of the experiment. (Top): The experiment is structured into blocks. Each subject will complete 10 blocks, alternating between overt and covert EEG experiments. Each block consists of 20 trials, presenting five words in a pseudo-random order. (Bottom): In each trial, the same words were displayed on the screen at predetermined intervals (T = 0, 2, 4, 6, 8 seconds) and vanished at T + 1.5 seconds. Subjects are instructed to overtly pronounce or covertly imagine pronouncing the words five times following the blink of the words. A brief resting period, with a random duration ranging from 3 to 5 seconds, is provided after each trial.
  • Figure 4: Box plot illustrating the accuracy of covert speech recognition for all subjects ordered by the median accuracy: (a) Accuracy from pre-trained models; (b) Accuracy after fine-tuning. Significance levels are compared between FAST and each of the baselines, indicated with asterisks (*), where (ns) denotes p-values > 0.05. The random performance range is indicated in the gray bar.
  • Figure 5: Feature visualization of the ST on covert speech EEG data averaged across all subjects. The features were extracted from the leave-out subject after each round of leave-one-subject-out training. (a) A heatmap illustrates the features generated by the ST layers along with the corresponding time-locked features. (b) Normalized activation maps highlight the relative activation scores across the frontal, temporal, and occipital lobes. (c) Difference activation maps show the one-versus-all contrast for each region.
  • ...and 4 more figures