Table of Contents
Fetching ...

Toward Fully-End-to-End Listened Speech Decoding from EEG Signals

Jihwan Lee, Aditya Kommineni, Tiantian Feng, Kleanthis Avramidis, Xuan Shi, Sudarsana Kadiri, Shrikanth Narayanan

TL;DR

This work tackles the challenge of decoding listened speech from EEG signals by proposing FESDE, a fully-end-to-end framework that directly reconstructs speech waveforms without intermediate acoustic representations. The model composes an EEG module (CNN+S4) for robust EEG representations, a VITS-based speech module for waveform generation, and a connector with a flow to align the two latent spaces, enabling single-step inference via an EEG encoder, connector, and speech decoder. Training combines cosine-similarity objectives for EEG with mel-spectrogram reconstruction, KL-divergence, and GAN-based losses for the speech path, with a gradient stop to stabilize learning. Empirical results on the N400 EEG dataset show improved objective metrics over a VLAAI baseline, and the phoneme-level analysis reveals which phoneme classes are easier or harder to decode, paving the way for future extensions to imagined or phonated speech decoding.

Abstract

Speech decoding from EEG signals is a challenging task, where brain activity is modeled to estimate salient characteristics of acoustic stimuli. We propose FESDE, a novel framework for Fully-End-to-end Speech Decoding from EEG signals. Our approach aims to directly reconstruct listened speech waveforms given EEG signals, where no intermediate acoustic feature processing step is required. The proposed method consists of an EEG module and a speech module along with a connector. The EEG module learns to better represent EEG signals, while the speech module generates speech waveforms from model representations. The connector learns to bridge the distributions of the latent spaces of EEG and speech. The proposed framework is both simple and efficient, by allowing single-step inference, and outperforms prior works on objective metrics. A fine-grained phoneme analysis is conducted to unveil model characteristics of speech decoding. The source code is available here: github.com/lee-jhwn/fesde.

Toward Fully-End-to-End Listened Speech Decoding from EEG Signals

TL;DR

This work tackles the challenge of decoding listened speech from EEG signals by proposing FESDE, a fully-end-to-end framework that directly reconstructs speech waveforms without intermediate acoustic representations. The model composes an EEG module (CNN+S4) for robust EEG representations, a VITS-based speech module for waveform generation, and a connector with a flow to align the two latent spaces, enabling single-step inference via an EEG encoder, connector, and speech decoder. Training combines cosine-similarity objectives for EEG with mel-spectrogram reconstruction, KL-divergence, and GAN-based losses for the speech path, with a gradient stop to stabilize learning. Empirical results on the N400 EEG dataset show improved objective metrics over a VLAAI baseline, and the phoneme-level analysis reveals which phoneme classes are easier or harder to decode, paving the way for future extensions to imagined or phonated speech decoding.

Abstract

Speech decoding from EEG signals is a challenging task, where brain activity is modeled to estimate salient characteristics of acoustic stimuli. We propose FESDE, a novel framework for Fully-End-to-end Speech Decoding from EEG signals. Our approach aims to directly reconstruct listened speech waveforms given EEG signals, where no intermediate acoustic feature processing step is required. The proposed method consists of an EEG module and a speech module along with a connector. The EEG module learns to better represent EEG signals, while the speech module generates speech waveforms from model representations. The connector learns to bridge the distributions of the latent spaces of EEG and speech. The proposed framework is both simple and efficient, by allowing single-step inference, and outperforms prior works on objective metrics. A fine-grained phoneme analysis is conducted to unveil model characteristics of speech decoding. The source code is available here: github.com/lee-jhwn/fesde.
Paper Structure (18 sections, 4 equations, 2 figures, 3 tables)

This paper contains 18 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The overall schematic of the proposed method. The EEG module is trained to produce descriptive representations of EEG signals. The speech module aims to generate speech waveform from the speech embeddings. The connector converts the distribution of the EEG embedding into the speech embedding. During inference, only the EEG encoder and the speech decoder are utilized, along with the connector.
  • Figure 2: MCD (dB) and Mel-Corr (%) of each phoneme group. The lower MCD and higher Mel-Corr indicate better performance. The consonants (blue) are clustered by three criteria: manner, place, and tenseness of articulation. The vowels (red) are clustered by its tongue position and tenseness.