Table of Contents
Fetching ...

Decoding inner speech with an end-to-end brain-to-text neural interface

Yizi Zhang, Linyang He, Chaofei Fan, Tingkai Liu, Han Yu, Trung Le, Jingyuan Li, Scott Linderman, Lea Duncker, Francis R Willett, Nima Mesgarani, Liam Paninski

TL;DR

BIT introduces an end-to-end brain-to-text framework that translates neural activity directly into sentences by integrating a cross-species transformer encoder with an audio-LLM decoder and a cross-modal alignment objective. The approach achieves state-of-the-art results on Brain-to-Text benchmarks and substantially narrows the gap between end-to-end and cascaded decoding, especially when using small audio LLMs and SSL pretraining. Key findings include strong cross-task transfer between attempted and imagined speech, and interpretability evidence that neural embeddings preserve semantic structure aligned to language models. This work advances end-to-end neural decoding, enabling differentiable optimization across perception and language generation with practical implications for communication aids.

Abstract

Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end Brain-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.

Decoding inner speech with an end-to-end brain-to-text neural interface

TL;DR

BIT introduces an end-to-end brain-to-text framework that translates neural activity directly into sentences by integrating a cross-species transformer encoder with an audio-LLM decoder and a cross-modal alignment objective. The approach achieves state-of-the-art results on Brain-to-Text benchmarks and substantially narrows the gap between end-to-end and cascaded decoding, especially when using small audio LLMs and SSL pretraining. Key findings include strong cross-task transfer between attempted and imagined speech, and interpretability evidence that neural embeddings preserve semantic structure aligned to language models. This work advances end-to-end neural decoding, enabling differentiable optimization across perception and language generation with practical implications for communication aids.

Abstract

Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end Brain-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.

Paper Structure

This paper contains 81 sections, 3 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Schematic illustration of BIT.(A) BIT is an end-to-end speech decoding framework that translates neural activity directly into text by combining a cross-task, cross-species pretrained neural encoder with an audio-LLM decoder. The data are separately obtained and preprocessed from each study. (Appendix \ref{['sec:data_details']}). (B) The neural encoder is a transformer that embeds 20 ms bins of thresholded spikes and spike-band power into multi-bin time patches. It is pretrained using SSL with time-patch masking, reconstructing patch tokens via subject-specific linear read-in and read-out layers with an MSE loss. After pretraining, the masking module is removed, and the encoder is fine-tuned for phoneme decoding using a linear classifier with CTC loss. (C) The neural encoder outputs are mapped to the text embedding space of an audio-LLM via a shallow MLP projector. A modality aligner trained with contrastive learning projects mean-pooled neural and text embeddings into a shared latent space for modality alignment. To guide decoding, we insert a prompt between neural and text embedding tokens: "decode the above neural activity into an English sentence:". During finetuning, we update the neural encoder, projector, and apply LoRA to the linear layers in the audio-LLM’s attention and feed-forward blocks, while keeping other parameters frozen.
  • Figure 2: Benchmarking BIT versus baselines in attempted and imagined speech decoding.(A) For attempted speech, the pretrained encoder ([1]BIT-Human, [1]BIT-All) outperforms RNN and [1]BIT-TFS using both cascaded and end-to-end approaches. Bar plots show mean WER across competition holdout sentences. (B) For imagined speech (50-word vocabulary), [1]BIT-All outperforms all other baselines in both cascaded and end-to-end settings. Bar plots show mean WER across partitioned test sentences. SSL pretraining provides greater benefits for imagined speech than for attempted speech, since imagined speech has far fewer labeled examples. (C) Scatterplots compare [1]BIT-All vs. [1]BIT-Cross-Task-Only on imagined speech decoding, with each dot representing a test sentence and the green value showing relative improvement. Results show that SSL pretraining (cross-subject, unlabeled) yields larger transfer gains than SL pretraining (within-subject, cross-task) after fine-tuning. (D) Example decoded sentences from [1]BIT-All using end-to-end approach. The imagined speech task has a smaller vocabulary (50 words) than attempted speech.
  • Figure 3: LLM decoder ablation across modality, model size, prompt design, and contrastive learning usage.(A) For audio-LLMs, neural activity can be treated as either a neural or an audio modality. For neural modality, encoder outputs are projected directly into the text embedding space via an MLP projector. For audio modality, neural encoder outputs pass through the MLP projector followed by a multimodal projector used for the audio encoder. (B) Different prompts are used depending on whether neural encoder outputs are treated as neural or audio modality, with the encoder outputs inserted at the placeholder “$\#$”. (C-D) Bar plots show the mean WER across validation sentences for different LLM models, modality treatments, prompt designs, and contrastive learning usage. Here, we report validation WER, as it is used to select the final LLM decoder for benchmark submission. Colors distinguish text- (yellow) and audio-LLMs (blue), with transparency indicating whether neural activity is treated as Neural or Audio modality, while diagonal hatching denotes that contrastive learning is not used.
  • Figure 4: BIT aligns attempted and imagined speech neural embeddings to enable cross-task generalization.(A) Representational similarity analysis (RSA) scores between neural and audio-LLM text embeddings. (B) PCA embeddings of neural features from participant T12 are visualized on the first two PCs. Word-level embeddings are averaged across time and trials and shown as dots. The same words are shown for both tasks. Colors indicate tasks, and color intensity represents the distance between attempted and imagined speech embeddings for each word (darker colors indicate higher similarity). The line shows the LDA linear discriminant. (C) For [1]BIT-All, PCA is applied to neural encoder outputs from participant T12, with word-level embeddings from the top two PCs visualized as dots. Same plotting conventions as panel B. (D) Using a cross-attention projector in BIT allows us to visualize attention weights, which reveal that neural-text temporal alignment is similar across tasks.
  • Figure 5: Distribution of neural token lengths across sentences for RSA. We restrict RSA to sentences with token lengths between 45 and 80 (mean length $\approx$ 63) for participant T12 and between 120 and 200 (mean length $\approx$ 160) for participant T15, since neural embeddings are converted into fixed-length sentence vectors by dividing each sequence into ten temporal segments and concatenating their averages. Sequences that are too short lack sufficient resolution, while overly long sequences introduce imbalance. These ranges ensure reliable and comparable sentence-level embeddings for cross-modal RSA.
  • ...and 3 more figures