Table of Contents
Fetching ...

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals

Hui Zheng, Hai-Teng Wang, Wei-Bang Jiang, Zhong-Tao Chen, Li He, Pei-Yang Lin, Peng-Hu Wei, Guo-Guang Zhao, Yun-Zhe Liu

TL;DR

The Du-IN model, which extracts contextual embeddings based on region-level tokens based on region-level tokens through discrete codex-guided mask modeling, is suitable for invasive brain modeling and represents a promising neuro-inspired AI approach in brain-computer interfaces.

Abstract

Invasive brain-computer interfaces with Electrocorticography (ECoG) have shown promise for high-performance speech decoding in medical applications, but less damaging methods like intracranial stereo-electroencephalography (sEEG) remain underexplored. With rapid advances in representation learning, leveraging abundant recordings to enhance speech decoding is increasingly attractive. However, popular methods often pre-train temporal models based on brain-level tokens, overlooking that brain activities in different regions are highly desynchronized during tasks. Alternatively, they pre-train spatial-temporal models based on channel-level tokens but fail to evaluate them on challenging tasks like speech decoding, which requires intricate processing in specific language-related areas. To address this issue, we collected a well-annotated Chinese word-reading sEEG dataset targeting language-related brain networks from 12 subjects. Using this benchmark, we developed the Du-IN model, which extracts contextual embeddings based on region-level tokens through discrete codex-guided mask modeling. Our model achieves state-of-the-art performance on the 61-word classification task, surpassing all baselines. Model comparisons and ablation studies reveal that our design choices, including (i) temporal modeling based on region-level tokens by utilizing 1D depthwise convolution to fuse channels in the ventral sensorimotor cortex (vSMC) and superior temporal gyrus (STG) and (ii) self-supervision through discrete codex-guided mask modeling, significantly contribute to this performance. Overall, our approach -- inspired by neuroscience findings and capitalizing on region-level representations from specific brain regions -- is suitable for invasive brain modeling and represents a promising neuro-inspired AI approach in brain-computer interfaces.

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals

TL;DR

The Du-IN model, which extracts contextual embeddings based on region-level tokens based on region-level tokens through discrete codex-guided mask modeling, is suitable for invasive brain modeling and represents a promising neuro-inspired AI approach in brain-computer interfaces.

Abstract

Invasive brain-computer interfaces with Electrocorticography (ECoG) have shown promise for high-performance speech decoding in medical applications, but less damaging methods like intracranial stereo-electroencephalography (sEEG) remain underexplored. With rapid advances in representation learning, leveraging abundant recordings to enhance speech decoding is increasingly attractive. However, popular methods often pre-train temporal models based on brain-level tokens, overlooking that brain activities in different regions are highly desynchronized during tasks. Alternatively, they pre-train spatial-temporal models based on channel-level tokens but fail to evaluate them on challenging tasks like speech decoding, which requires intricate processing in specific language-related areas. To address this issue, we collected a well-annotated Chinese word-reading sEEG dataset targeting language-related brain networks from 12 subjects. Using this benchmark, we developed the Du-IN model, which extracts contextual embeddings based on region-level tokens through discrete codex-guided mask modeling. Our model achieves state-of-the-art performance on the 61-word classification task, surpassing all baselines. Model comparisons and ablation studies reveal that our design choices, including (i) temporal modeling based on region-level tokens by utilizing 1D depthwise convolution to fuse channels in the ventral sensorimotor cortex (vSMC) and superior temporal gyrus (STG) and (ii) self-supervision through discrete codex-guided mask modeling, significantly contribute to this performance. Overall, our approach -- inspired by neuroscience findings and capitalizing on region-level representations from specific brain regions -- is suitable for invasive brain modeling and represents a promising neuro-inspired AI approach in brain-computer interfaces.
Paper Structure (57 sections, 12 equations, 11 figures, 19 tables)

This paper contains 57 sections, 12 equations, 11 figures, 19 tables.

Figures (11)

  • Figure 1: Overall illustration of sEEG decoding setup and comparison with SOTA baselines.
  • Figure 2: The overall architecture of Du-IN Encoder. Du-IN Encoder is used as an encoder in all Du-IN models (i.e., Du-IN VQ-VAE, Du-IN MAE, Du-IN CLS (classification)), see Appendix \ref{['sec:supp-model-details']} for more details.
  • Figure 3: Overview of Du-IN VQ-VAE training and Du-IN MAE training.(a). We train the Du-IN Encoder in the Du-IN VQ-VAE to discretize sEEG signals into discrete neural tokens by reconstructing the original sEEG signals. (b). During the training of Du-IN MAE, part of sEEG patches are masked while the objective is to predict masked tokens from visible patches.
  • Figure 4: The channel contribution analysis.(a). The channel contribution map. (b). The effect of the number of channels (sorted according to channel contribution scores) on decoding performance.
  • Figure 5: Ablation study on different codex sizes, codex dimensions, and receptive fields.
  • ...and 6 more figures