Table of Contents
Fetching ...

Aligning Brain Signals with Multimodal Speech and Vision Embeddings

Kateryna Shapovalenko, Quentin Auster

TL;DR

The paper tackles decoding language from brain signals by aligning EEG data with layer-aware embeddings from wav2vec2 and CLIP, evaluated across three aggregation strategies. It introduces a three-stage methodology combining ridge regression, layer-depth selection, and an EEG encoder trained with a CLIP-style contrastive loss, applied to word-aligned EEG during natural speech. The key finding is that mid-depth layers from both audio models yield the most robust alignment with EEG, with progressive summation offering better generalization than concatenation, though overall generalization remains limited and contrastive decoding struggles to converge. This work highlights the potential of multimodal, layer-aware representations for brain-to-language decoding while underscoring the need for subject-invariant models and larger datasets to bridge the gap to robust real-world application.

Abstract

When we hear the word "house", we don't just process sound, we imagine walls, doors, memories. The brain builds meaning through layers, moving from raw acoustics to rich, multimodal associations. Inspired by this, we build on recent work from Meta that aligned EEG signals with averaged wav2vec2 speech embeddings, and ask a deeper question: which layers of pre-trained models best reflect this layered processing in the brain? We compare embeddings from two models: wav2vec2, which encodes sound into language, and CLIP, which maps words to images. Using EEG recorded during natural speech perception, we evaluate how these embeddings align with brain activity using ridge regression and contrastive decoding. We test three strategies: individual layers, progressive concatenation, and progressive summation. The findings suggest that combining multimodal, layer-aware representations may bring us closer to decoding how the brain understands language, not just as sound, but as experience.

Aligning Brain Signals with Multimodal Speech and Vision Embeddings

TL;DR

The paper tackles decoding language from brain signals by aligning EEG data with layer-aware embeddings from wav2vec2 and CLIP, evaluated across three aggregation strategies. It introduces a three-stage methodology combining ridge regression, layer-depth selection, and an EEG encoder trained with a CLIP-style contrastive loss, applied to word-aligned EEG during natural speech. The key finding is that mid-depth layers from both audio models yield the most robust alignment with EEG, with progressive summation offering better generalization than concatenation, though overall generalization remains limited and contrastive decoding struggles to converge. This work highlights the potential of multimodal, layer-aware representations for brain-to-language decoding while underscoring the need for subject-invariant models and larger datasets to bridge the gap to robust real-world application.

Abstract

When we hear the word "house", we don't just process sound, we imagine walls, doors, memories. The brain builds meaning through layers, moving from raw acoustics to rich, multimodal associations. Inspired by this, we build on recent work from Meta that aligned EEG signals with averaged wav2vec2 speech embeddings, and ask a deeper question: which layers of pre-trained models best reflect this layered processing in the brain? We compare embeddings from two models: wav2vec2, which encodes sound into language, and CLIP, which maps words to images. Using EEG recorded during natural speech perception, we evaluate how these embeddings align with brain activity using ridge regression and contrastive decoding. We test three strategies: individual layers, progressive concatenation, and progressive summation. The findings suggest that combining multimodal, layer-aware representations may bring us closer to decoding how the brain understands language, not just as sound, but as experience.

Paper Structure

This paper contains 15 sections, 9 figures.

Figures (9)

  • Figure 1: EEG Data Preprocessing Pipeline: (a) noisy channel removal, (b) time/frequency feature extraction, (c) normalization and outlier correction.
  • Figure 2: Overview of Methodology.
  • Figure 3: Selecting the Best Number of Components for PCA and ICA.
  • Figure 4: Single-layer regression results (PCA, Subject S04).
  • Figure 5: Progressive concatenation results (PCA, Subject S04).
  • ...and 4 more figures