Table of Contents
Fetching ...

Scaling Law in Neural Data: Non-Invasive Speech Decoding with 175 Hours of EEG Data

Motoshige Sato, Kenichi Tomeoka, Ilya Horiguchi, Kai Arulkumaran, Ryota Kanai, Shuntaro Sasai

TL;DR

This work tackles the challenge of open-vocabulary speech decoding from non-invasive EEG by collecting 175 hours of EEG during overt speech from a single participant and training a CLIP-based self-supervised model to align EEG latents with audio latents. The approach enables zero-shot phrase classification and speech reconstruction via a diffusion vocoder, achieving approximately 48.5% top-1 and 76.0% top-10 accuracy on a 512-class task, illustrating a data-length scaling law where performance improves with more data. Key contributions include demonstrating robust EEG-based zero-shot decoding, showing that longer data yields clearer temporal structure in EEG representations, and mitigating EMG contamination through augmentation and artifact-robust training. The results suggest that open-vocabulary, non-invasive EEG speech BCIs are feasible and motivate broader multi-subject data collection and exploration of transfer and covert-speech capabilities for practical deployment.

Abstract

Brain-computer interfaces (BCIs) hold great potential for aiding individuals with speech impairments. Utilizing electroencephalography (EEG) to decode speech is particularly promising due to its non-invasive nature. However, recordings are typically short, and the high variability in EEG data has led researchers to focus on classification tasks with a few dozen classes. To assess its practical applicability for speech neuroprostheses, we investigate the relationship between the size of EEG data and decoding accuracy in the open vocabulary setting. We collected extensive EEG data from a single participant (175 hours) and conducted zero-shot speech segment classification using self-supervised representation learning. The model trained on the entire dataset achieved a top-1 accuracy of 48\% and a top-10 accuracy of 76\%, while mitigating the effects of myopotential artifacts. Conversely, when the data was limited to the typical amount used in practice ($\sim$10 hours), the top-1 accuracy dropped to 2.5\%, revealing a significant scaling effect. Additionally, as the amount of training data increased, the EEG latent representation progressively exhibited clearer temporal structures of spoken phrases. This indicates that the decoder can recognize speech segments in a data-driven manner without explicit measurements of word recognition. This research marks a significant step towards the practical realization of EEG-based speech BCIs.

Scaling Law in Neural Data: Non-Invasive Speech Decoding with 175 Hours of EEG Data

TL;DR

This work tackles the challenge of open-vocabulary speech decoding from non-invasive EEG by collecting 175 hours of EEG during overt speech from a single participant and training a CLIP-based self-supervised model to align EEG latents with audio latents. The approach enables zero-shot phrase classification and speech reconstruction via a diffusion vocoder, achieving approximately 48.5% top-1 and 76.0% top-10 accuracy on a 512-class task, illustrating a data-length scaling law where performance improves with more data. Key contributions include demonstrating robust EEG-based zero-shot decoding, showing that longer data yields clearer temporal structure in EEG representations, and mitigating EMG contamination through augmentation and artifact-robust training. The results suggest that open-vocabulary, non-invasive EEG speech BCIs are feasible and motivate broader multi-subject data collection and exploration of transfer and covert-speech capabilities for practical deployment.

Abstract

Brain-computer interfaces (BCIs) hold great potential for aiding individuals with speech impairments. Utilizing electroencephalography (EEG) to decode speech is particularly promising due to its non-invasive nature. However, recordings are typically short, and the high variability in EEG data has led researchers to focus on classification tasks with a few dozen classes. To assess its practical applicability for speech neuroprostheses, we investigate the relationship between the size of EEG data and decoding accuracy in the open vocabulary setting. We collected extensive EEG data from a single participant (175 hours) and conducted zero-shot speech segment classification using self-supervised representation learning. The model trained on the entire dataset achieved a top-1 accuracy of 48\% and a top-10 accuracy of 76\%, while mitigating the effects of myopotential artifacts. Conversely, when the data was limited to the typical amount used in practice (10 hours), the top-1 accuracy dropped to 2.5\%, revealing a significant scaling effect. Additionally, as the amount of training data increased, the EEG latent representation progressively exhibited clearer temporal structures of spoken phrases. This indicates that the decoder can recognize speech segments in a data-driven manner without explicit measurements of word recognition. This research marks a significant step towards the practical realization of EEG-based speech BCIs.
Paper Structure (27 sections, 4 equations, 7 figures, 5 tables)

This paper contains 27 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Decoding framework. EEG and speech were recorded simultaneously and converted to latent representations by a (fixed) audio encoder and an EEG encoder, respectively, for each 5-second segment. In step (1), the CLIP loss is applied on the N-segment pairwise cosine similarity matrix. In step (2), a diffusion vocoder was trained to reconstruct the speech waveform from the EEG latent representation.
  • Figure 2: A diagram of EEG preprocessing and EEG encoder. (a) The pre-processing procedure. (b) EEG encoder architecture. The number of feature dimensions $F^*$ and the number of time steps $T^*$ in the latent representation differ depending on the audio encoder. The output shape of each layer is shown in the right column.
  • Figure 3: Data scaling. The relationship between the training dataset size (total segment length) and the top-1 accuracy (left), top-10 accuracy(center), and loss (right) on the test dataset. The black dashed line indicates chance level, and the orange dashed line indicates the best linear fit to the data. The dataset sizes of the green and red arrows are the dataset sizes of the datasets Broderick2018eegdatasetBrennan2019eegdataset used in defossez2023decoding, respectively.
  • Figure 4: Voice activity detected from EEG latent representations without explicit training. (a) Process of speech interval detection. The speech waveform (upper left) and EEG (lower left) were each converted into a latent representation (upper right color map) through the encoder, and the variance for each feature dimension was taken in a sliding window of 100 ms and then averaged across feature domain (lower blue line). Intervals above the threshold (orange line) for this value were detected as speaking periods, and intervals below the threshold were detected as silent periods. The speech segment for the ground truth was determined by applying a threshold value to the waveform envelop. In this example, the overlap between the speech segments and the segments detected by the EEG latent (accuracy) was 0.88. (b) The relationship between the speech detection accuracy and the training dataset size.
  • Figure 5: Representative recorded voice (left) and reconstructed voice (middle). The top panels show the voice waveforms and the bottom panels are mel-spectrograms. MCD scores (right), where smaller values indicate better performance. These were compared to a random model trained on a dataset with shuffled EEG and speech correspondences. The box plot illustrates the distribution of the scores obtained from 8,448 test samples. Decoding performance significantly outperformed chance ($p^{***}<{10}^{-3}, {t}_{8447}=-73.5$, paired t-test).
  • ...and 2 more figures