NeuSpeech: Decode Neural signal as Speech
Yiqian Yang, Yiqun Duan, Qiang Zhang, Hyejeong Jo, Jinni Zhou, Won Hee Lee, Renjing Xu, Hui Xiong
TL;DR
NeuSpeech addresses non-invasive MEG-to-text decoding by translating raw MEG waves into open-vocabulary text using a cross-attention Whisper-based encoder-decoder. It introduces MEG-specific input adaptations and AdaLora-tuned encoder training while keeping the decoder fixed, enabling end-to-end learning without pretraining or teacher forcing. Across GWilliams and Schoffelen datasets, NeuSpeech achieves competitive $BLEU$-1 scores (up to $60.30$ on GWilliams and $53.16$ on Schoffelen) and demonstrates generalization across languages and equipment, with comprehensive ablations on pretraining, joint training, scaling, and augmentation. The work highlights the feasibility of non-invasive brain-to-text interfaces and informs future directions for large-model pretraining, multi-layout fusion, and data-efficient learning, while noting limitations tied to data scarcity and signal variability.
Abstract
Decoding language from brain dynamics is an important open direction in the realm of brain-computer interface (BCI), especially considering the rapid growth of large language models. Compared to invasive-based signals which require electrode implantation surgery, non-invasive neural signals (e.g. EEG, MEG) have attracted increasing attention considering their safety and generality. However, the exploration is not adequate in three aspects: 1) previous methods mainly focus on EEG but none of the previous works address this problem on MEG with better signal quality; 2) prior works have predominantly used $``teacher-forcing"$ during generative decoding, which is impractical; 3) prior works are mostly $``BART-based"$ not fully auto-regressive, which performs better in other sequence tasks. In this paper, we explore the brain-to-text translation of MEG signals in a speech-decoding formation. Here we are the first to investigate a cross-attention-based ``whisper" model for generating text directly from MEG signals without teacher forcing. Our model achieves impressive BLEU-1 scores of 60.30 and 52.89 without pretraining $\&$ teacher-forcing on two major datasets ($\textit{GWilliams}$ and $\textit{Schoffelen}$). This paper conducts a comprehensive review to understand how speech decoding formation performs on the neural decoding tasks, including pretraining initialization, training $\&$ evaluation set splitting, augmentation, and scaling law. Code is available at https://github.com/NeuSpeech/NeuSpeech1$.
