Table of Contents
Fetching ...

ALIGN: Adversarial Learning for Generalizable Speech Neuroprosthesis

Zhanqi Zhang, Shun Li, Bernardo L. Sabatini, Mikio Aoi, Gal Mishne

Abstract

Intracortical brain-computer interfaces (BCIs) can decode speech from neural activity with high accuracy when trained on data pooled across recording sessions. In realistic deployment, however, models must generalize to new sessions without labeled data, and performance often degrades due to cross-session nonstationarities (e.g., electrode shifts, neural turnover, and changes in user strategy). In this paper, we propose ALIGN, a session-invariant learning framework based on multi-domain adversarial neural networks for semi-supervised cross-session adaptation. ALIGN trains a feature encoder jointly with a phoneme classifier and a domain classifier operating on the latent representation. Through adversarial optimization, the encoder is encouraged to preserve task-relevant information while suppressing session-specific cues. We evaluate ALIGN on intracortical speech decoding and find that it generalizes consistently better to previously unseen sessions, improving both phoneme error rate and word error rate relative to baselines. These results indicate that adversarial domain alignment is an effective approach for mitigating session-level distribution shift and enabling robust longitudinal BCI decoding.

ALIGN: Adversarial Learning for Generalizable Speech Neuroprosthesis

Abstract

Intracortical brain-computer interfaces (BCIs) can decode speech from neural activity with high accuracy when trained on data pooled across recording sessions. In realistic deployment, however, models must generalize to new sessions without labeled data, and performance often degrades due to cross-session nonstationarities (e.g., electrode shifts, neural turnover, and changes in user strategy). In this paper, we propose ALIGN, a session-invariant learning framework based on multi-domain adversarial neural networks for semi-supervised cross-session adaptation. ALIGN trains a feature encoder jointly with a phoneme classifier and a domain classifier operating on the latent representation. Through adversarial optimization, the encoder is encouraged to preserve task-relevant information while suppressing session-specific cues. We evaluate ALIGN on intracortical speech decoding and find that it generalizes consistently better to previously unseen sessions, improving both phoneme error rate and word error rate relative to baselines. These results indicate that adversarial domain alignment is an effective approach for mitigating session-level distribution shift and enabling robust longitudinal BCI decoding.
Paper Structure (23 sections, 9 equations, 9 figures, 4 tables)

This paper contains 23 sections, 9 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Dataset and domain shift across sessions.a. Attempted speech dataset. Intracortical neural activity was recorded from a participant with amyotrophic lateral sclerosis (ALS) while they attempted to speak sentences presented on a screen. Although vocalizations were produced, the speech was not intelligible. b. Example neural activity corresponding to the same attempted sentence recorded in two different sessions, illustrating session-dependent neural dynamics variabilities. c. Illustration of distribution shift in source sessions and target sessions caused by nonstationarities in neural dynamics and the session-invariant distributions with adaptation. d. t-SNE visualization of the latent embeddings used for phoneme decoding (input to the phoneme classifier), showing source sessions (blue) and target sessions (red) before and after ALIGN.
  • Figure 2: ALIGN model architecture. Our ALIGN model consists of three major modules. The feature encoder $f$ and phoneme classifier $p$ backbone shown here is instantiated with a Transformer-based decoder (ucla). We added a domain classifier $d$, which takes the transformer encoder embedding as input and the multi-head classifiers are trained to predict whether the input is coming from source or target sessions. A gradient reversal layer is used to make the encoder learn session-invariant features.
  • Figure 3: Embedding visualization before and after ALIGN. (a) PCA projections of the intermediate latent embeddings produced by the feature encoder in the transformer baseline decoder for T12 (top) and in ALIGN (bottom). Twelve source sessions are shown in color, along with three target sessions shown in gray. (b) Corresponding embeddings of final latent embeddings in both model.
  • Figure 4: ALIGN with and without TTA. Test WER of GRU baseline (green) without TTA, transformer baseline (blue) and ALIGN model (orange) with and without TTA (dark and light shades), where TTA is trained from the first target session (left) or from the first test session (right). All models are tested on three different train-test partition for 5 seeds.
  • Figure 5: Per-day phoneme error rate (PER). Mean and std of PER of the transformer baseline (blue) and ALIGN (orange) on T12 12--8--3 dataset. The first eight sessions correspond to validation of target sessions, and the last three correspond to held-out test sessions.
  • ...and 4 more figures