Enhancing Listened Speech Decoding from EEG via Parallel Phoneme Sequence Prediction
Jihwan Lee, Tiantian Feng, Aditya Kommineni, Sudarsana Reddy Kadiri, Shrikanth Narayanan
TL;DR
This work addresses decoding listened speech from EEG by introducing a parallel three-component framework consisting of an EEG module, a speech module, and a phoneme predictor. The model outputs both listened speech waveforms and textual phoneme sequences directly from EEG embeddings, leveraging a stop-gradient connection and a combination of reconstruction, KL, and GAN objectives, with a CTC-based phoneme loss guiding sequence decoding. Empirical results on the N400 dataset show improvements over baselines in both waveform quality (lower $MCD$, higher $Mel\text{-}Corr$) and phoneme sequence decoding (higher Top-k accuracy), with insights into the effects of conformer depth and phoneme groups. The approach enables true parallel decoding of modalities, has potential for real-time BCI applications, and points to future work in speech production decoding and using pre-trained models for each modality.
Abstract
Brain-computer interfaces (BCI) offer numerous human-centered application possibilities, particularly affecting people with neurological disorders. Text or speech decoding from brain activities is a relevant domain that could augment the quality of life for people with impaired speech perception. We propose a novel approach to enhance listened speech decoding from electroencephalography (EEG) signals by utilizing an auxiliary phoneme predictor that simultaneously decodes textual phoneme sequences. The proposed model architecture consists of three main parts: EEG module, speech module, and phoneme predictor. The EEG module learns to properly represent EEG signals into EEG embeddings. The speech module generates speech waveforms from the EEG embeddings. The phoneme predictor outputs the decoded phoneme sequences in text modality. Our proposed approach allows users to obtain decoded listened speech from EEG signals in both modalities (speech waveforms and textual phoneme sequences) simultaneously, eliminating the need for a concatenated sequential pipeline for each modality. The proposed approach also outperforms previous methods in both modalities. The source code and speech samples are publicly available.
