Neural2Speech: A Transfer Learning Framework for Neural-Driven Speech Reconstruction
Jiawei Li, Chunxu Guo, Li Fu, Lu Fan, Edward F. Chang, Yuanning Li
TL;DR
This work tackles neural-driven speech reconstruction under data scarcity by introducing Neural2Speech, a two-phase transfer-learning framework. It combines a pre-trained speech autoencoder (based on a Wav2Vec2.0 encoder and a HiFi-GAN generator) with a lightweight, two-layer LSTM neural feature adaptor to map neural activity to speech representations, enabling sentence-level reconstruction from as little as 20 minutes of intracranial data. The autoencoder is trained on LibriSpeech to learn rich speech representations, while the adaptor aligns ECoG signals to those representations; the system achieves a mean $MSE$ of $0.067$ and an average $ESTOI$ of $0.371$ (best $0.395$) with a $PER$ around $0.286$, outperforming baselines trained from scratch. This demonstrates the practical viability of transfer-learning approaches for high-fidelity, intelligible speech reconstruction in BCIs, with significant implications for assistive communication in patients with paralysis.
Abstract
Reconstructing natural speech from neural activity is vital for enabling direct communication via brain-computer interfaces. Previous efforts have explored the conversion of neural recordings into speech using complex deep neural network (DNN) models trained on extensive neural recording data, which is resource-intensive under regular clinical constraints. However, achieving satisfactory performance in reconstructing speech from limited-scale neural recordings has been challenging, mainly due to the complexity of speech representations and the neural data constraints. To overcome these challenges, we propose a novel transfer learning framework for neural-driven speech reconstruction, called Neural2Speech, which consists of two distinct training phases. First, a speech autoencoder is pre-trained on readily available speech corpora to decode speech waveforms from the encoded speech representations. Second, a lightweight adaptor is trained on the small-scale neural recordings to align the neural activity and the speech representation for decoding. Remarkably, our proposed Neural2Speech demonstrates the feasibility of neural-driven speech reconstruction even with only 20 minutes of intracranial data, which significantly outperforms existing baseline methods in terms of speech fidelity and intelligibility.
