Neural Speech Embeddings for Speech Synthesis Based on Deep Generative Networks

Seo-Hyun Lee; Young-Eun Lee; Soowon Kim; Byung-Kwan Ko; Jun-Young Kim; Seong-Whan Lee

Neural Speech Embeddings for Speech Synthesis Based on Deep Generative Networks

Seo-Hyun Lee, Young-Eun Lee, Soowon Kim, Byung-Kwan Ko, Jun-Young Kim, Seong-Whan Lee

TL;DR

The paper addresses translating brain activity into spoken language using non-invasive EEG-based brain-to-speech systems. It introduces an embedding-based pipeline that combines CSP-derived spatial filters and log-variance features, with shared CSP filters trained on imagined speech to align imagined and spoken domains, and it analyzes spatio-spectral correlates of speech production. The study provides details on dataset collection (six participants, 64-channel EEG), preprocessing, feature embeddings, and an analysis framework showing that imagined speech embeddings approximate the temporal dynamics of spoken speech and that domain adaptation brings their representations closer in latent space. It also reports consistent high-gamma engagement in the $90$ Hz and above range and distinct text-specific neural patterns. The work demonstrates the feasibility of neural embeddings for non-invasive BTS and outlines future directions, including larger phoneme-rich datasets and potential invasive measurements to further improve robustness and applicability for speech-impaired individuals.

Abstract

Brain-to-speech technology represents a fusion of interdisciplinary applications encompassing fields of artificial intelligence, brain-computer interfaces, and speech synthesis. Neural representation learning based intention decoding and speech synthesis directly connects the neural activity to the means of human linguistic communication, which may greatly enhance the naturalness of communication. With the current discoveries on representation learning and the development of the speech synthesis technologies, direct translation of brain signals into speech has shown great promise. Especially, the processed input features and neural speech embeddings which are given to the neural network play a significant role in the overall performance when using deep generative models for speech generation from brain signals. In this paper, we introduce the current brain-to-speech technology with the possibility of speech synthesis from brain signals, which may ultimately facilitate innovation in non-verbal communication. Also, we perform comprehensive analysis on the neural features and neural speech embeddings underlying the neurophysiological activation while performing speech, which may play a significant role in the speech synthesis works.

Neural Speech Embeddings for Speech Synthesis Based on Deep Generative Networks

TL;DR

Hz and above range and distinct text-specific neural patterns. The work demonstrates the feasibility of neural embeddings for non-invasive BTS and outlines future directions, including larger phoneme-rich datasets and potential invasive measurements to further improve robustness and applicability for speech-impaired individuals.

Abstract

Paper Structure (12 sections, 3 figures)

This paper contains 12 sections, 3 figures.

INTRODUCTION
MATERIALS AND METHODS
Dataset
Preprocessing
Feature Embeddings
Spatio-temporal Analysis
RESULTS AND DISCUSSION
Feature Embeddings
Embedding Vector Distributions
Spatio-spectral Features
Limitations and Future Works
CONCLUSION

Figures (3)

Figure 1: Feature embedding for the spoken speech and the imagined speech is demonstrated. The feature matrix was constructed using the time-wise computation of the CSP pattern, divided into 16 time points per each EEG segment. The size of the pattern for each time point was set to 104 since 8 patterns per class were computed using the multi-CSP algorithm. The value under the mean of each column was ignored to display the temporal variations of the embedding features.
Figure 2: t-SNE plot of features before and after the adaptation process. Clusters of the imagined speech and the spoken speech have shown clear distance in the original features. However, adapted features show relatively distributed aspects across same classes in different domain (blue and red samples show broad clusters.)
Figure 3: Temporal-spatio-spectral analysis of imagined speech 'thank you'. Changes of the power spectrum for (A) imagined speech and (B) spoken EEG is plotted for every 20$Hz$ frequency intervals in time shifts of 250$ms$.

Neural Speech Embeddings for Speech Synthesis Based on Deep Generative Networks

TL;DR

Abstract

Neural Speech Embeddings for Speech Synthesis Based on Deep Generative Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (3)