Time-Masked Transformers with Lightweight Test-Time Adaptation for Neural Speech Decoding

Ebrahim Feghhi; Shreyas Kaasyap; Nima Hadidi; Jonathan C. Kao

Time-Masked Transformers with Lightweight Test-Time Adaptation for Neural Speech Decoding

Ebrahim Feghhi, Shreyas Kaasyap, Nima Hadidi, Jonathan C. Kao

TL;DR

This paper tackles real-time neural speech decoding for speech neuroprostheses using intracranial MEA data. It introduces three key innovations: large-scale time-masking during training, a compact unidirectional Transformer that replaces the GRU baseline, and DietCORP, a light-weight test-time adaptation method leveraging multiple time-masked augmentations of a single trial. The time-masked Transformer achieves substantial WER reductions (over 20–26% relative) while dramatically reducing model size, memory, FLOPs, and calibration time, making on-device deployment more feasible. DietCORP further stabilizes performance across held-out days with minimal per-trial cost, enabling robust real-time adaptation. Together, these advances move neural speech decoding toward accurate, efficient, and portable brain-computer interface systems for paralyzed individuals.

Abstract

Speech neuroprostheses aim to restore communication for people with severe paralysis by decoding speech directly from neural activity. To accelerate algorithmic progress, a recent benchmark released intracranial recordings from a paralyzed participant attempting to speak, along with a baseline decoding algorithm. Prior work on the benchmark showed impressive accuracy gains. However, these gains increased computational costs and were not demonstrated in a real-time decoding setting. Here, we make three contributions that pave the way towards accurate, efficient, and real-time neural speech decoding. First, we incorporate large amounts of time-masking during training. On average, over $50\%$ of each trial is masked. Second, we replace the gated recurrent unit (GRU) architecture used in the baseline algorithm with a compact Transformer. The Transformer architecture uses $83\%$ fewer parameters, cuts peak GPU memory usage by $52\%$, and is significantly faster to calibrate relative to the GRU. Third, we design a lightweight variant of an existing test-time adaptation method developed for decoding handwriting from neural activity. Our variant adapts the model using multiple time-masked augmentations of a single trial and requires only one gradient step per trial. Together, these contributions reduce word error rate by over $20\%$ and effectively mitigate performance degradations across held-out days in a real-time decoding setting while substantially lowering computational costs.

Time-Masked Transformers with Lightweight Test-Time Adaptation for Neural Speech Decoding

TL;DR

Abstract

Time-Masked Transformers with Lightweight Test-Time Adaptation for Neural Speech Decoding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)