Table of Contents
Fetching ...

Neural2Speech: A Transfer Learning Framework for Neural-Driven Speech Reconstruction

Jiawei Li, Chunxu Guo, Li Fu, Lu Fan, Edward F. Chang, Yuanning Li

TL;DR

This work tackles neural-driven speech reconstruction under data scarcity by introducing Neural2Speech, a two-phase transfer-learning framework. It combines a pre-trained speech autoencoder (based on a Wav2Vec2.0 encoder and a HiFi-GAN generator) with a lightweight, two-layer LSTM neural feature adaptor to map neural activity to speech representations, enabling sentence-level reconstruction from as little as 20 minutes of intracranial data. The autoencoder is trained on LibriSpeech to learn rich speech representations, while the adaptor aligns ECoG signals to those representations; the system achieves a mean $MSE$ of $0.067$ and an average $ESTOI$ of $0.371$ (best $0.395$) with a $PER$ around $0.286$, outperforming baselines trained from scratch. This demonstrates the practical viability of transfer-learning approaches for high-fidelity, intelligible speech reconstruction in BCIs, with significant implications for assistive communication in patients with paralysis.

Abstract

Reconstructing natural speech from neural activity is vital for enabling direct communication via brain-computer interfaces. Previous efforts have explored the conversion of neural recordings into speech using complex deep neural network (DNN) models trained on extensive neural recording data, which is resource-intensive under regular clinical constraints. However, achieving satisfactory performance in reconstructing speech from limited-scale neural recordings has been challenging, mainly due to the complexity of speech representations and the neural data constraints. To overcome these challenges, we propose a novel transfer learning framework for neural-driven speech reconstruction, called Neural2Speech, which consists of two distinct training phases. First, a speech autoencoder is pre-trained on readily available speech corpora to decode speech waveforms from the encoded speech representations. Second, a lightweight adaptor is trained on the small-scale neural recordings to align the neural activity and the speech representation for decoding. Remarkably, our proposed Neural2Speech demonstrates the feasibility of neural-driven speech reconstruction even with only 20 minutes of intracranial data, which significantly outperforms existing baseline methods in terms of speech fidelity and intelligibility.

Neural2Speech: A Transfer Learning Framework for Neural-Driven Speech Reconstruction

TL;DR

This work tackles neural-driven speech reconstruction under data scarcity by introducing Neural2Speech, a two-phase transfer-learning framework. It combines a pre-trained speech autoencoder (based on a Wav2Vec2.0 encoder and a HiFi-GAN generator) with a lightweight, two-layer LSTM neural feature adaptor to map neural activity to speech representations, enabling sentence-level reconstruction from as little as 20 minutes of intracranial data. The autoencoder is trained on LibriSpeech to learn rich speech representations, while the adaptor aligns ECoG signals to those representations; the system achieves a mean of and an average of (best ) with a around , outperforming baselines trained from scratch. This demonstrates the practical viability of transfer-learning approaches for high-fidelity, intelligible speech reconstruction in BCIs, with significant implications for assistive communication in patients with paralysis.

Abstract

Reconstructing natural speech from neural activity is vital for enabling direct communication via brain-computer interfaces. Previous efforts have explored the conversion of neural recordings into speech using complex deep neural network (DNN) models trained on extensive neural recording data, which is resource-intensive under regular clinical constraints. However, achieving satisfactory performance in reconstructing speech from limited-scale neural recordings has been challenging, mainly due to the complexity of speech representations and the neural data constraints. To overcome these challenges, we propose a novel transfer learning framework for neural-driven speech reconstruction, called Neural2Speech, which consists of two distinct training phases. First, a speech autoencoder is pre-trained on readily available speech corpora to decode speech waveforms from the encoded speech representations. Second, a lightweight adaptor is trained on the small-scale neural recordings to align the neural activity and the speech representation for decoding. Remarkably, our proposed Neural2Speech demonstrates the feasibility of neural-driven speech reconstruction even with only 20 minutes of intracranial data, which significantly outperforms existing baseline methods in terms of speech fidelity and intelligibility.
Paper Structure (14 sections, 9 equations, 3 figures, 2 tables)

This paper contains 14 sections, 9 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the Neural2Speech framework.
  • Figure 2: Visualization of the speech waveform and the mel-spectrogram of the raw speech and speech reconstructed from ECoG recordings using different methods for a sample sentence.
  • Figure 3: Transcription PER for individual trials on different reconstruction methods and their corresponding Kernel Density Estimation (KDE) curves.