Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals

Hui Zheng; Hai-Teng Wang; Wei-Bang Jiang; Zhong-Tao Chen; Li He; Pei-Yang Lin; Peng-Hu Wei; Guo-Guang Zhao; Yun-Zhe Liu

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals

Hui Zheng, Hai-Teng Wang, Wei-Bang Jiang, Zhong-Tao Chen, Li He, Pei-Yang Lin, Peng-Hu Wei, Guo-Guang Zhao, Yun-Zhe Liu

TL;DR

The Du-IN model, which extracts contextual embeddings based on region-level tokens based on region-level tokens through discrete codex-guided mask modeling, is suitable for invasive brain modeling and represents a promising neuro-inspired AI approach in brain-computer interfaces.

Abstract

Invasive brain-computer interfaces with Electrocorticography (ECoG) have shown promise for high-performance speech decoding in medical applications, but less damaging methods like intracranial stereo-electroencephalography (sEEG) remain underexplored. With rapid advances in representation learning, leveraging abundant recordings to enhance speech decoding is increasingly attractive. However, popular methods often pre-train temporal models based on brain-level tokens, overlooking that brain activities in different regions are highly desynchronized during tasks. Alternatively, they pre-train spatial-temporal models based on channel-level tokens but fail to evaluate them on challenging tasks like speech decoding, which requires intricate processing in specific language-related areas. To address this issue, we collected a well-annotated Chinese word-reading sEEG dataset targeting language-related brain networks from 12 subjects. Using this benchmark, we developed the Du-IN model, which extracts contextual embeddings based on region-level tokens through discrete codex-guided mask modeling. Our model achieves state-of-the-art performance on the 61-word classification task, surpassing all baselines. Model comparisons and ablation studies reveal that our design choices, including (i) temporal modeling based on region-level tokens by utilizing 1D depthwise convolution to fuse channels in the ventral sensorimotor cortex (vSMC) and superior temporal gyrus (STG) and (ii) self-supervision through discrete codex-guided mask modeling, significantly contribute to this performance. Overall, our approach -- inspired by neuroscience findings and capitalizing on region-level representations from specific brain regions -- is suitable for invasive brain modeling and represents a promising neuro-inspired AI approach in brain-computer interfaces.

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals

TL;DR

Abstract

Paper Structure (57 sections, 12 equations, 11 figures, 19 tables)

This paper contains 57 sections, 12 equations, 11 figures, 19 tables.

Introduction
Related Works
Neural Basis of Language Function
Language Decoding in BCI
Self-supervised Learning in BCI
Method
Task Definition
Model Architecture
Spatial Encoder.
Temporal Embedding.
Transformer Encoder.
Du-IN VQ-VAE Training
Du-IN Encoder.
Du-IN Regressor.
Pre-training Du-IN
...and 42 more sections

Figures (11)

Figure 1: Overall illustration of sEEG decoding setup and comparison with SOTA baselines.
Figure 2: The overall architecture of Du-IN Encoder. Du-IN Encoder is used as an encoder in all Du-IN models (i.e., Du-IN VQ-VAE, Du-IN MAE, Du-IN CLS (classification)), see Appendix \ref{['sec:supp-model-details']} for more details.
Figure 3: Overview of Du-IN VQ-VAE training and Du-IN MAE training.(a). We train the Du-IN Encoder in the Du-IN VQ-VAE to discretize sEEG signals into discrete neural tokens by reconstructing the original sEEG signals. (b). During the training of Du-IN MAE, part of sEEG patches are masked while the objective is to predict masked tokens from visible patches.
Figure 4: The channel contribution analysis.(a). The channel contribution map. (b). The effect of the number of channels (sorted according to channel contribution scores) on decoding performance.
Figure 5: Ablation study on different codex sizes, codex dimensions, and receptive fields.
...and 6 more figures

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals

TL;DR

Abstract

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals

Authors

TL;DR

Abstract

Table of Contents

Figures (11)