Table of Contents
Fetching ...

Recreating Neural Activity During Speech Production with Language and Speech Model Embeddings

Owais Mujtaba Khanday, Pablo Rodroguez San Esteban, Zubair Ahmad Lone, Marc Ouellet, Jose Andres Gonzalez Lopez

TL;DR

The paper tackles reconstructing neural activity during speech production by leveraging embeddings from large self-supervised language and speech models. It uses an ElasticNet mapping to predict high-gamma sEEG features from word- and audio-derived embeddings obtained from FastText, GPT-2.0, and Wav2Vec 2.0 XLS-R, evaluated with leave-one-out cross-validation. The results show strong reconstruction across participants, with $PCC$ and $R^2$ values reaching up to $0.99$, though performance varies with electrode coverage and subject, particularly for Wav2Vec 2.0. These findings indicate that linguistic and acoustic representations in pre-trained models align with neural processes underlying speech, informing future neural speech interfaces and neuroscience studies.

Abstract

Understanding how neural activity encodes speech and language production is a fundamental challenge in neuroscience and artificial intelligence. This study investigates whether embeddings from large-scale, self-supervised language and speech models can effectively reconstruct high-gamma neural activity characteristics, key indicators of cortical processing, recorded during speech production. We leverage pre-trained embeddings from deep learning models trained on linguistic and acoustic data to represent high-level speech features and map them onto these high-gamma signals. We analyze the extent to which these embeddings preserve the spatio-temporal dynamics of brain activity. Reconstructed neural signals are evaluated against high-gamma ground-truth activity using correlation metrics and signal reconstruction quality assessments. The results indicate that high-gamma activity can be effectively reconstructed using large language and speech model embeddings in all study participants, generating Pearson's correlation coefficients ranging from 0.79 to 0.99.

Recreating Neural Activity During Speech Production with Language and Speech Model Embeddings

TL;DR

The paper tackles reconstructing neural activity during speech production by leveraging embeddings from large self-supervised language and speech models. It uses an ElasticNet mapping to predict high-gamma sEEG features from word- and audio-derived embeddings obtained from FastText, GPT-2.0, and Wav2Vec 2.0 XLS-R, evaluated with leave-one-out cross-validation. The results show strong reconstruction across participants, with and values reaching up to , though performance varies with electrode coverage and subject, particularly for Wav2Vec 2.0. These findings indicate that linguistic and acoustic representations in pre-trained models align with neural processes underlying speech, informing future neural speech interfaces and neuroscience studies.

Abstract

Understanding how neural activity encodes speech and language production is a fundamental challenge in neuroscience and artificial intelligence. This study investigates whether embeddings from large-scale, self-supervised language and speech models can effectively reconstruct high-gamma neural activity characteristics, key indicators of cortical processing, recorded during speech production. We leverage pre-trained embeddings from deep learning models trained on linguistic and acoustic data to represent high-level speech features and map them onto these high-gamma signals. We analyze the extent to which these embeddings preserve the spatio-temporal dynamics of brain activity. Reconstructed neural signals are evaluated against high-gamma ground-truth activity using correlation metrics and signal reconstruction quality assessments. The results indicate that high-gamma activity can be effectively reconstructed using large language and speech model embeddings in all study participants, generating Pearson's correlation coefficients ranging from 0.79 to 0.99.

Paper Structure

This paper contains 14 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Methodology overview.
  • Figure 2: $R^2$ scores for the FastText embeddings.
  • Figure 3: $R^2$ scores for the GPT-2.0 embeddings.
  • Figure 4: $R^2$ scores for the Wav2Vec 2.0 embeddings.
  • Figure 5: Average distribution of trials based on $R^2$ scores.