Table of Contents
Fetching ...

Decoding Phone Pairs from MEG Signals Across Speech Modalities

Xabier de Zuazo, Eva Navas, Ibon Saratxaga, Mathieu Bourguignon, Nicola Molinaro

TL;DR

The paper addresses decoding phonetic information from MEG during overt speech production and perception, contrasting a broad set of classifiers. It finds that production yields substantially higher decoding accuracy than perception, with Elastic Net outperforming neural networks in this limited-data regime. Low-frequency bands, especially Delta and Theta, carry the most phonetic information during production, while higher frequencies contribute minimally, highlighting the role of slow oscillations in speech motor processes. The work contributes a reproducible pipeline, a 15-phoneme pairing scheme, and insights for designing MEG-based speech BCIs, while noting possible residual artifacts and the need for larger datasets to generalize beyond the present sample.

Abstract

Understanding the neural mechanisms underlying speech production is essential for both advancing cognitive neuroscience theory and developing practical communication technologies. In this study, we investigated magnetoencephalography signals to decode phones from brain activity during speech production and perception (passive listening and voice playback) tasks. Using a dataset comprising 17 participants, we performed pairwise phone classification, extending our analysis to 15 phonetic pairs. Multiple machine learning approaches, including regularized linear models and neural network architectures, were compared to determine their effectiveness in decoding phonetic information. Our results demonstrate significantly higher decoding accuracy during speech production (76.6%) compared to passive listening and playback modalities (~51%), emphasizing the richer neural information available during overt speech. Among the models, the Elastic Net classifier consistently outperformed more complex neural networks, highlighting the effectiveness of traditional regularization techniques when applied to limited and high-dimensional MEG datasets. Besides, analysis of specific brain frequency bands revealed that low-frequency oscillations, particularly Delta (0.2-3 Hz) and Theta (4-7 Hz), contributed the most substantially to decoding accuracy, suggesting that these bands encode critical speech production-related neural processes. Despite using advanced denoising methods, it remains unclear whether decoding solely reflects neural activity or if residual muscular or movement artifacts also contributed, indicating the need for further methodological refinement. Overall, our findings underline the critical importance of examining overt speech production paradigms, which, despite their complexity, offer opportunities to improve brain-computer interfaces to help individuals with severe speech impairments.

Decoding Phone Pairs from MEG Signals Across Speech Modalities

TL;DR

The paper addresses decoding phonetic information from MEG during overt speech production and perception, contrasting a broad set of classifiers. It finds that production yields substantially higher decoding accuracy than perception, with Elastic Net outperforming neural networks in this limited-data regime. Low-frequency bands, especially Delta and Theta, carry the most phonetic information during production, while higher frequencies contribute minimally, highlighting the role of slow oscillations in speech motor processes. The work contributes a reproducible pipeline, a 15-phoneme pairing scheme, and insights for designing MEG-based speech BCIs, while noting possible residual artifacts and the need for larger datasets to generalize beyond the present sample.

Abstract

Understanding the neural mechanisms underlying speech production is essential for both advancing cognitive neuroscience theory and developing practical communication technologies. In this study, we investigated magnetoencephalography signals to decode phones from brain activity during speech production and perception (passive listening and voice playback) tasks. Using a dataset comprising 17 participants, we performed pairwise phone classification, extending our analysis to 15 phonetic pairs. Multiple machine learning approaches, including regularized linear models and neural network architectures, were compared to determine their effectiveness in decoding phonetic information. Our results demonstrate significantly higher decoding accuracy during speech production (76.6%) compared to passive listening and playback modalities (~51%), emphasizing the richer neural information available during overt speech. Among the models, the Elastic Net classifier consistently outperformed more complex neural networks, highlighting the effectiveness of traditional regularization techniques when applied to limited and high-dimensional MEG datasets. Besides, analysis of specific brain frequency bands revealed that low-frequency oscillations, particularly Delta (0.2-3 Hz) and Theta (4-7 Hz), contributed the most substantially to decoding accuracy, suggesting that these bands encode critical speech production-related neural processes. Despite using advanced denoising methods, it remains unclear whether decoding solely reflects neural activity or if residual muscular or movement artifacts also contributed, indicating the need for further methodological refinement. Overall, our findings underline the critical importance of examining overt speech production paradigms, which, despite their complexity, offer opportunities to improve brain-computer interfaces to help individuals with severe speech impairments.

Paper Structure

This paper contains 18 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Average phone count in all subjects, including production, listening, and playback tasks. The black bars show their standard deviation. The phones selected for this work are marked in green, including five vowels and 10 consonants.
  • Figure 2: Averaged decoding accuracies for all the subjects on the listening task.
  • Figure 3: Averaged decoding accuracies for all the subjects on the production task.
  • Figure 4: Decoding accuracies by frequency band for speech perception (Listening and Playback) and speech production modalities. The plot y-axis ranges from 40% to 80% for better visibility. Bold annotations indicate the best accuracy for each modality. The dashed black line represents the chance accuracy level (50%), and the dashed blue line indicates the baseline (ceiling) accuracy for the production modality (74.02%), obtained without frequency filtering.