Table of Contents
Fetching ...

NeuSpeech: Decode Neural signal as Speech

Yiqian Yang, Yiqun Duan, Qiang Zhang, Hyejeong Jo, Jinni Zhou, Won Hee Lee, Renjing Xu, Hui Xiong

TL;DR

NeuSpeech addresses non-invasive MEG-to-text decoding by translating raw MEG waves into open-vocabulary text using a cross-attention Whisper-based encoder-decoder. It introduces MEG-specific input adaptations and AdaLora-tuned encoder training while keeping the decoder fixed, enabling end-to-end learning without pretraining or teacher forcing. Across GWilliams and Schoffelen datasets, NeuSpeech achieves competitive $BLEU$-1 scores (up to $60.30$ on GWilliams and $53.16$ on Schoffelen) and demonstrates generalization across languages and equipment, with comprehensive ablations on pretraining, joint training, scaling, and augmentation. The work highlights the feasibility of non-invasive brain-to-text interfaces and informs future directions for large-model pretraining, multi-layout fusion, and data-efficient learning, while noting limitations tied to data scarcity and signal variability.

Abstract

Decoding language from brain dynamics is an important open direction in the realm of brain-computer interface (BCI), especially considering the rapid growth of large language models. Compared to invasive-based signals which require electrode implantation surgery, non-invasive neural signals (e.g. EEG, MEG) have attracted increasing attention considering their safety and generality. However, the exploration is not adequate in three aspects: 1) previous methods mainly focus on EEG but none of the previous works address this problem on MEG with better signal quality; 2) prior works have predominantly used $``teacher-forcing"$ during generative decoding, which is impractical; 3) prior works are mostly $``BART-based"$ not fully auto-regressive, which performs better in other sequence tasks. In this paper, we explore the brain-to-text translation of MEG signals in a speech-decoding formation. Here we are the first to investigate a cross-attention-based ``whisper" model for generating text directly from MEG signals without teacher forcing. Our model achieves impressive BLEU-1 scores of 60.30 and 52.89 without pretraining $\&$ teacher-forcing on two major datasets ($\textit{GWilliams}$ and $\textit{Schoffelen}$). This paper conducts a comprehensive review to understand how speech decoding formation performs on the neural decoding tasks, including pretraining initialization, training $\&$ evaluation set splitting, augmentation, and scaling law. Code is available at https://github.com/NeuSpeech/NeuSpeech1$.

NeuSpeech: Decode Neural signal as Speech

TL;DR

NeuSpeech addresses non-invasive MEG-to-text decoding by translating raw MEG waves into open-vocabulary text using a cross-attention Whisper-based encoder-decoder. It introduces MEG-specific input adaptations and AdaLora-tuned encoder training while keeping the decoder fixed, enabling end-to-end learning without pretraining or teacher forcing. Across GWilliams and Schoffelen datasets, NeuSpeech achieves competitive -1 scores (up to on GWilliams and on Schoffelen) and demonstrates generalization across languages and equipment, with comprehensive ablations on pretraining, joint training, scaling, and augmentation. The work highlights the feasibility of non-invasive brain-to-text interfaces and informs future directions for large-model pretraining, multi-layout fusion, and data-efficient learning, while noting limitations tied to data scarcity and signal variability.

Abstract

Decoding language from brain dynamics is an important open direction in the realm of brain-computer interface (BCI), especially considering the rapid growth of large language models. Compared to invasive-based signals which require electrode implantation surgery, non-invasive neural signals (e.g. EEG, MEG) have attracted increasing attention considering their safety and generality. However, the exploration is not adequate in three aspects: 1) previous methods mainly focus on EEG but none of the previous works address this problem on MEG with better signal quality; 2) prior works have predominantly used during generative decoding, which is impractical; 3) prior works are mostly not fully auto-regressive, which performs better in other sequence tasks. In this paper, we explore the brain-to-text translation of MEG signals in a speech-decoding formation. Here we are the first to investigate a cross-attention-based ``whisper" model for generating text directly from MEG signals without teacher forcing. Our model achieves impressive BLEU-1 scores of 60.30 and 52.89 without pretraining teacher-forcing on two major datasets ( and ). This paper conducts a comprehensive review to understand how speech decoding formation performs on the neural decoding tasks, including pretraining initialization, training evaluation set splitting, augmentation, and scaling law. Code is available at https://github.com/NeuSpeech/NeuSpeech1$.
Paper Structure (28 sections, 4 figures, 8 tables)

This paper contains 28 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: NeuSpeech overview. MEG signal is recorded while the subject is listening to speech. Our model is trained in an end-to-end manner, and only trains the AdaLora module applied on encoder and convolution layers using cross-entropy loss. In evaluation, we tested in both situations, w/ and w/o teacher forcing. In testing, teacher forcing means predict next token using previous ground truth tokens, rather than model generated tokens.
  • Figure 2: Performance on different model sizes. Black numbers represent the BLEU-1 score, green numbers are effective epochs after which the evaluation loss does not descend. Note that each experiment runs 120 epochs.
  • Figure 3: Performance changes with different data augmentations. The probability means the likelihood of adding augmentation to each data segment.
  • Figure 4: Fine-tune layers and data ratio. Blue line for different fine-tune layers, green line for different training data ratios. Note the whisper-base model only has 6 layers.