Table of Contents
Fetching ...

Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication

Deok-Seon Kim, Seo-Hyun Lee, Kang Yin, Seong-Whan Lee

TL;DR

This work advances open-vocabulary neural communication by reconstructing unconstrained sentences from non-invasive biosignals, primarily high-density EEG with optional EMG. The authors introduce a subject-specific framework that outputs sentence-level MFCCs and phoneme sequences, leveraging a multimodal input, a ConvBlock–Bi-GRU architecture, and a HiFi-GAN vocoder with DeepSpeech for evaluation. They show that combining EEG and EMG significantly improves phoneme decoding and speech intelligibility for unseen sentences, with notable gains in overt and whispered speech and meaningful, albeit lower, performance for imagined speech. Neurophysiological analyses reveal frequency- and region-specific patterns across speech modes, highlighting delta rhythms as a temporal scaffold for speech, frontal involvement in imagined speech, and sustained temporal activation across modalities. These findings pave the way for adaptive, non-invasive BTS systems capable of supporting open-vocabulary communication and rehabilitation across diverse patient needs, while pointing to future work in robust imagined-speech decoding and larger, more varied datasets.

Abstract

Brain-to-speech (BTS) systems represent a groundbreaking approach to human communication by enabling the direct transformation of neural activity into linguistic expressions. While recent non-invasive BTS studies have largely focused on decoding predefined words or sentences, achieving open-vocabulary neural communication comparable to natural human interaction requires decoding unconstrained speech. Additionally, effectively integrating diverse signals derived from speech is crucial for developing personalized and adaptive neural communication and rehabilitation solutions for patients. This study investigates the potential of speech synthesis for previously unseen sentences across various speech modes by leveraging phoneme-level information extracted from high-density electroencephalography (EEG) signals, both independently and in conjunction with electromyography (EMG) signals. Furthermore, we examine the properties affecting phoneme decoding accuracy during sentence reconstruction and offer neurophysiological insights to further enhance EEG decoding for more effective neural communication solutions. Our findings underscore the feasibility of biosignal-based sentence-level speech synthesis for reconstructing unseen sentences, highlighting a significant step toward developing open-vocabulary neural communication systems adapted to diverse patient needs and conditions. Additionally, this study provides meaningful insights into the development of communication and rehabilitation solutions utilizing EEG-based decoding technologies.

Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication

TL;DR

This work advances open-vocabulary neural communication by reconstructing unconstrained sentences from non-invasive biosignals, primarily high-density EEG with optional EMG. The authors introduce a subject-specific framework that outputs sentence-level MFCCs and phoneme sequences, leveraging a multimodal input, a ConvBlock–Bi-GRU architecture, and a HiFi-GAN vocoder with DeepSpeech for evaluation. They show that combining EEG and EMG significantly improves phoneme decoding and speech intelligibility for unseen sentences, with notable gains in overt and whispered speech and meaningful, albeit lower, performance for imagined speech. Neurophysiological analyses reveal frequency- and region-specific patterns across speech modes, highlighting delta rhythms as a temporal scaffold for speech, frontal involvement in imagined speech, and sustained temporal activation across modalities. These findings pave the way for adaptive, non-invasive BTS systems capable of supporting open-vocabulary communication and rehabilitation across diverse patient needs, while pointing to future work in robust imagined-speech decoding and larger, more varied datasets.

Abstract

Brain-to-speech (BTS) systems represent a groundbreaking approach to human communication by enabling the direct transformation of neural activity into linguistic expressions. While recent non-invasive BTS studies have largely focused on decoding predefined words or sentences, achieving open-vocabulary neural communication comparable to natural human interaction requires decoding unconstrained speech. Additionally, effectively integrating diverse signals derived from speech is crucial for developing personalized and adaptive neural communication and rehabilitation solutions for patients. This study investigates the potential of speech synthesis for previously unseen sentences across various speech modes by leveraging phoneme-level information extracted from high-density electroencephalography (EEG) signals, both independently and in conjunction with electromyography (EMG) signals. Furthermore, we examine the properties affecting phoneme decoding accuracy during sentence reconstruction and offer neurophysiological insights to further enhance EEG decoding for more effective neural communication solutions. Our findings underscore the feasibility of biosignal-based sentence-level speech synthesis for reconstructing unseen sentences, highlighting a significant step toward developing open-vocabulary neural communication systems adapted to diverse patient needs and conditions. Additionally, this study provides meaningful insights into the development of communication and rehabilitation solutions utilizing EEG-based decoding technologies.

Paper Structure

This paper contains 26 sections, 6 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Experimental paradigm with three different speech modes: overt, whispered, and imagined speech. EEG, EMG, and audio signals of 474 different sentences including various phonemes were recorded.
  • Figure 2: The overall decoding framework. EEG data, or EEG combined with EMG data, is provided as the input, obtained from different modes of speech (overt, whispered, and imagined speech). The model comprises three Convblocks followed by Bi-GRU. The model generates the reconstructed MFCC and phoneme sequences. A pre-trained vocoder ($v$) synthesizes audio from the predicted MFCC and a pre-trained ASR model ($A$) converts the reconstructed voice into text.
  • Figure 3: Results of unseen sentence reconstruction. (a) Phoneme accuracy of phoneme sequence, (b) RMSE between the target and reconstructed MFCC, (c) MCD between the original and generated MFCC, (d) F1-score between GT and predicted phoneme sequence. The results are averaged across all 15 participants. The Wilcoxon signed-rank test, a non-parametric statistical method, was performed to evaluate the significance differences in paired data. Significance at the level of $p <$ 0.0001 is indicated by ***.
  • Figure 4: Mel-spectrograms and the audio waveforms of the original voice, the reconstructed voice from biosignals of overt and whispered speech, from a representative subject. The target sentence is "New life with brain-computer interface technology".
  • Figure 5: Confusion matrix of predicted phoneme sequence aggregated across all participants. The matrix compares the predicted phonemes with GT for both overt and whispered speech modes. Red boxes highlight predefined phoneme groups.
  • ...and 3 more figures