Table of Contents
Fetching ...

Enhancing Listened Speech Decoding from EEG via Parallel Phoneme Sequence Prediction

Jihwan Lee, Tiantian Feng, Aditya Kommineni, Sudarsana Reddy Kadiri, Shrikanth Narayanan

TL;DR

This work addresses decoding listened speech from EEG by introducing a parallel three-component framework consisting of an EEG module, a speech module, and a phoneme predictor. The model outputs both listened speech waveforms and textual phoneme sequences directly from EEG embeddings, leveraging a stop-gradient connection and a combination of reconstruction, KL, and GAN objectives, with a CTC-based phoneme loss guiding sequence decoding. Empirical results on the N400 dataset show improvements over baselines in both waveform quality (lower $MCD$, higher $Mel\text{-}Corr$) and phoneme sequence decoding (higher Top-k accuracy), with insights into the effects of conformer depth and phoneme groups. The approach enables true parallel decoding of modalities, has potential for real-time BCI applications, and points to future work in speech production decoding and using pre-trained models for each modality.

Abstract

Brain-computer interfaces (BCI) offer numerous human-centered application possibilities, particularly affecting people with neurological disorders. Text or speech decoding from brain activities is a relevant domain that could augment the quality of life for people with impaired speech perception. We propose a novel approach to enhance listened speech decoding from electroencephalography (EEG) signals by utilizing an auxiliary phoneme predictor that simultaneously decodes textual phoneme sequences. The proposed model architecture consists of three main parts: EEG module, speech module, and phoneme predictor. The EEG module learns to properly represent EEG signals into EEG embeddings. The speech module generates speech waveforms from the EEG embeddings. The phoneme predictor outputs the decoded phoneme sequences in text modality. Our proposed approach allows users to obtain decoded listened speech from EEG signals in both modalities (speech waveforms and textual phoneme sequences) simultaneously, eliminating the need for a concatenated sequential pipeline for each modality. The proposed approach also outperforms previous methods in both modalities. The source code and speech samples are publicly available.

Enhancing Listened Speech Decoding from EEG via Parallel Phoneme Sequence Prediction

TL;DR

This work addresses decoding listened speech from EEG by introducing a parallel three-component framework consisting of an EEG module, a speech module, and a phoneme predictor. The model outputs both listened speech waveforms and textual phoneme sequences directly from EEG embeddings, leveraging a stop-gradient connection and a combination of reconstruction, KL, and GAN objectives, with a CTC-based phoneme loss guiding sequence decoding. Empirical results on the N400 dataset show improvements over baselines in both waveform quality (lower , higher ) and phoneme sequence decoding (higher Top-k accuracy), with insights into the effects of conformer depth and phoneme groups. The approach enables true parallel decoding of modalities, has potential for real-time BCI applications, and points to future work in speech production decoding and using pre-trained models for each modality.

Abstract

Brain-computer interfaces (BCI) offer numerous human-centered application possibilities, particularly affecting people with neurological disorders. Text or speech decoding from brain activities is a relevant domain that could augment the quality of life for people with impaired speech perception. We propose a novel approach to enhance listened speech decoding from electroencephalography (EEG) signals by utilizing an auxiliary phoneme predictor that simultaneously decodes textual phoneme sequences. The proposed model architecture consists of three main parts: EEG module, speech module, and phoneme predictor. The EEG module learns to properly represent EEG signals into EEG embeddings. The speech module generates speech waveforms from the EEG embeddings. The phoneme predictor outputs the decoded phoneme sequences in text modality. Our proposed approach allows users to obtain decoded listened speech from EEG signals in both modalities (speech waveforms and textual phoneme sequences) simultaneously, eliminating the need for a concatenated sequential pipeline for each modality. The proposed approach also outperforms previous methods in both modalities. The source code and speech samples are publicly available.
Paper Structure (15 sections, 4 equations, 2 figures, 3 tables)

This paper contains 15 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The overall architecture of the proposed framework. The EEG module learns the EEG embeddings, which are then fed in parallel to both the phoneme predictor and the speech module to decode phoneme sequences and speech waveforms simultaneously.
  • Figure 2: (a) MCD (dB), (b) Mel-Corr (%), and (c) Top-3 accuracy (%) of each phoneme group with respect to the number of the conformer blocks. Better performance is indicated by the lower MCD, and higher Mel-Corr and top-3 accuracy. The consonants (bluish colors) are grouped by their manner, place, and voicedness of articulation. The vowels (reddish colors) are grouped by their tongue positions and tenseness.