Table of Contents
Fetching ...

A Penny for Your Thoughts: Decoding Speech from Inexpensive Brain Signals

Quentin Auster, Kateryna Shapovalenko, Chuang Ma, Demaio Sun

TL;DR

The paper investigates non-invasive EEG-based speech decoding by aligning EEG-derived embeddings with audio representations using a contrastive CLIP loss, building on a state-of-the-art Meta decoder. It introduces three personalized architectural enhancements—subject-specific attention, adaptive spatial attention, and a dual-path RNN with attention—testing their impact on decoding performance. Two of these modifications yield measurable gains, underscoring the value of personalization for brain-to-speech decoding and informing future BCI designs. The study demonstrates promising improvements on a naturalistic EEG-audio dataset and outlines concrete directions for scaling, preprocessing refinements, language-model integration, and ethical safeguards to advance practical, accessible brain-computer interfaces.

Abstract

We explore whether neural networks can decode brain activity into speech by mapping EEG recordings to audio representations. Using EEG data recorded as subjects listened to natural speech, we train a model with a contrastive CLIP loss to align EEG-derived embeddings with embeddings from a pre-trained transformer-based speech model. Building on the state-of-the-art EEG decoder from Meta, we introduce three architectural modifications: (i) subject-specific attention layers (+0.15% WER improvement), (ii) personalized spatial attention (+0.45%), and (iii) a dual-path RNN with attention (-1.87%). Two of the three modifications improved performance, highlighting the promise of personalized architectures for brain-to-speech decoding and applications in brain-computer interfaces.

A Penny for Your Thoughts: Decoding Speech from Inexpensive Brain Signals

TL;DR

The paper investigates non-invasive EEG-based speech decoding by aligning EEG-derived embeddings with audio representations using a contrastive CLIP loss, building on a state-of-the-art Meta decoder. It introduces three personalized architectural enhancements—subject-specific attention, adaptive spatial attention, and a dual-path RNN with attention—testing their impact on decoding performance. Two of these modifications yield measurable gains, underscoring the value of personalization for brain-to-speech decoding and informing future BCI designs. The study demonstrates promising improvements on a naturalistic EEG-audio dataset and outlines concrete directions for scaling, preprocessing refinements, language-model integration, and ethical safeguards to advance practical, accessible brain-computer interfaces.

Abstract

We explore whether neural networks can decode brain activity into speech by mapping EEG recordings to audio representations. Using EEG data recorded as subjects listened to natural speech, we train a model with a contrastive CLIP loss to align EEG-derived embeddings with embeddings from a pre-trained transformer-based speech model. Building on the state-of-the-art EEG decoder from Meta, we introduce three architectural modifications: (i) subject-specific attention layers (+0.15% WER improvement), (ii) personalized spatial attention (+0.45%), and (iii) a dual-path RNN with attention (-1.87%). Two of the three modifications improved performance, highlighting the promise of personalized architectures for brain-to-speech decoding and applications in brain-computer interfaces.

Paper Structure

This paper contains 27 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: From Sound to Brain Representation.
  • Figure 2: Pre-Processing of EEG and Audio Data
  • Figure 3: Model Architecture