Table of Contents
Fetching ...

Time-Masked Transformers with Lightweight Test-Time Adaptation for Neural Speech Decoding

Ebrahim Feghhi, Shreyas Kaasyap, Nima Hadidi, Jonathan C. Kao

TL;DR

This paper tackles real-time neural speech decoding for speech neuroprostheses using intracranial MEA data. It introduces three key innovations: large-scale time-masking during training, a compact unidirectional Transformer that replaces the GRU baseline, and DietCORP, a light-weight test-time adaptation method leveraging multiple time-masked augmentations of a single trial. The time-masked Transformer achieves substantial WER reductions (over 20–26% relative) while dramatically reducing model size, memory, FLOPs, and calibration time, making on-device deployment more feasible. DietCORP further stabilizes performance across held-out days with minimal per-trial cost, enabling robust real-time adaptation. Together, these advances move neural speech decoding toward accurate, efficient, and portable brain-computer interface systems for paralyzed individuals.

Abstract

Speech neuroprostheses aim to restore communication for people with severe paralysis by decoding speech directly from neural activity. To accelerate algorithmic progress, a recent benchmark released intracranial recordings from a paralyzed participant attempting to speak, along with a baseline decoding algorithm. Prior work on the benchmark showed impressive accuracy gains. However, these gains increased computational costs and were not demonstrated in a real-time decoding setting. Here, we make three contributions that pave the way towards accurate, efficient, and real-time neural speech decoding. First, we incorporate large amounts of time-masking during training. On average, over $50\%$ of each trial is masked. Second, we replace the gated recurrent unit (GRU) architecture used in the baseline algorithm with a compact Transformer. The Transformer architecture uses $83\%$ fewer parameters, cuts peak GPU memory usage by $52\%$, and is significantly faster to calibrate relative to the GRU. Third, we design a lightweight variant of an existing test-time adaptation method developed for decoding handwriting from neural activity. Our variant adapts the model using multiple time-masked augmentations of a single trial and requires only one gradient step per trial. Together, these contributions reduce word error rate by over $20\%$ and effectively mitigate performance degradations across held-out days in a real-time decoding setting while substantially lowering computational costs.

Time-Masked Transformers with Lightweight Test-Time Adaptation for Neural Speech Decoding

TL;DR

This paper tackles real-time neural speech decoding for speech neuroprostheses using intracranial MEA data. It introduces three key innovations: large-scale time-masking during training, a compact unidirectional Transformer that replaces the GRU baseline, and DietCORP, a light-weight test-time adaptation method leveraging multiple time-masked augmentations of a single trial. The time-masked Transformer achieves substantial WER reductions (over 20–26% relative) while dramatically reducing model size, memory, FLOPs, and calibration time, making on-device deployment more feasible. DietCORP further stabilizes performance across held-out days with minimal per-trial cost, enabling robust real-time adaptation. Together, these advances move neural speech decoding toward accurate, efficient, and portable brain-computer interface systems for paralyzed individuals.

Abstract

Speech neuroprostheses aim to restore communication for people with severe paralysis by decoding speech directly from neural activity. To accelerate algorithmic progress, a recent benchmark released intracranial recordings from a paralyzed participant attempting to speak, along with a baseline decoding algorithm. Prior work on the benchmark showed impressive accuracy gains. However, these gains increased computational costs and were not demonstrated in a real-time decoding setting. Here, we make three contributions that pave the way towards accurate, efficient, and real-time neural speech decoding. First, we incorporate large amounts of time-masking during training. On average, over of each trial is masked. Second, we replace the gated recurrent unit (GRU) architecture used in the baseline algorithm with a compact Transformer. The Transformer architecture uses fewer parameters, cuts peak GPU memory usage by , and is significantly faster to calibrate relative to the GRU. Third, we design a lightweight variant of an existing test-time adaptation method developed for decoding handwriting from neural activity. Our variant adapts the model using multiple time-masked augmentations of a single trial and requires only one gradient step per trial. Together, these contributions reduce word error rate by over and effectively mitigate performance degradations across held-out days in a real-time decoding setting while substantially lowering computational costs.

Paper Structure

This paper contains 27 sections, 8 equations, 2 figures, 8 tables, 1 algorithm.

Figures (2)

  • Figure 1: A. The GRU exhibits pronounced overfitting when training for long durations. Black dashed line indicates where training was stopped for the baseline model. B. Adjacent input windows to the GRU overlap by 87.5% when using the optimal baseline hyperparameters (window length = $640$ ms, stride = $80$ ms). C. We replaced the GRU with a lightweight Transformer-based model. The Transformer takes as input non-overlapping temporal patches of neural activity and outputs logits (denoted as pi). Consecutive patches were replaced with a MASK token during training, as denoted by dark coloring. We used the connectionist temporal classification (CTC) loss. D. An overview of DietCORP. In the top panel, the Transformer architecture is run in evaluation mode to generate logits, and these logits are integrated with a language model guided beam search to generate a pseudo-label. In the bottom panel, the model is trained to produce the pseudo-label across $Z$ time-masked augmentations with CTC loss. Only the patch embedding module is adapted during this process.
  • Figure 2: Results are for the time-masked Transformer with a 3-gram LM. Points show the mean over four seeds; shading indicates $\pm$SEM. A. WER across five held-out days without adaptation and with DietCORP. B. Same as panel A with evaluation on eight held-out days. C. Average WER across held-out days as a function of the number of augmentations used by DietCORP. Green points are when using DietCORP; the purple dashed line is when no adaptation is performed. Lower curves correspond to five held-out days, upper curves to eight held-out days.