Bridging Auditory Perception and Language Comprehension through MEG-Driven Encoding Models

Matteo Ciferri; Matteo Ferrante; Nicola Toschi

Bridging Auditory Perception and Language Comprehension through MEG-Driven Encoding Models

Matteo Ciferri, Matteo Ferrante, Nicola Toschi

TL;DR

This work addresses how the brain processes auditory and linguistic information by using MEG to evaluate encoding models that map both audio and text representations to neural activity. It introduces two audio encoders (STFT-based time-frequency decompositions and wav2vec2 embeddings) and two text encoders (CLIP and GPT-2 embeddings), assessed via ridge regression against MEG data from naturalistic stories. The key finding is that text-derived representations yield higher encoding accuracy and engage frontal regions such as Broca's area, while audio representations predominantly activate lateral temporal regions; these effects are modulated across the $8-30$ Hz band. These results reveal distinct neural pathways for auditory versus linguistic processing and demonstrate quantitative improvements in modeling neural responses to naturalistic language stimuli, with potential implications for brain-computer interfaces and language-related clinical applications.

Abstract

Understanding the neural mechanisms behind auditory and linguistic processing is key to advancing cognitive neuroscience. In this study, we use Magnetoencephalography (MEG) data to analyze brain responses to spoken language stimuli. We develop two distinct encoding models: an audio-to-MEG encoder, which uses time-frequency decompositions (TFD) and wav2vec2 latent space representations, and a text-to-MEG encoder, which leverages CLIP and GPT-2 embeddings. Both models successfully predict neural activity, demonstrating significant correlations between estimated and observed MEG signals. However, the text-to-MEG model outperforms the audio-based model, achieving higher Pearson Correlation (PC) score. Spatially, we identify that auditory-based embeddings (TFD and wav2vec2) predominantly activate lateral temporal regions, which are responsible for primary auditory processing and the integration of auditory signals. In contrast, textual embeddings (CLIP and GPT-2) primarily engage the frontal cortex, particularly Broca's area, which is associated with higher-order language processing, including semantic integration and language production, especially in the 8-30 Hz frequency range. The strong involvement of these regions suggests that auditory stimuli are processed through more direct sensory pathways, while linguistic information is encoded via networks that integrate meaning and cognitive control. Our results reveal distinct neural pathways for auditory and linguistic information processing, with higher encoding accuracy for text representations in the frontal regions. These insights refine our understanding of the brain's functional architecture in processing auditory and textual information, offering quantitative advancements in the modelling of neural responses to complex language stimuli.

Bridging Auditory Perception and Language Comprehension through MEG-Driven Encoding Models

TL;DR

Hz band. These results reveal distinct neural pathways for auditory versus linguistic processing and demonstrate quantitative improvements in modeling neural responses to naturalistic language stimuli, with potential implications for brain-computer interfaces and language-related clinical applications.

Abstract

Paper Structure (15 sections, 4 figures, 2 tables)

This paper contains 15 sections, 4 figures, 2 tables.

Introduction
Related Work
Material and Methods
Data
Audio Encoding Models
Text Encoding Models
Reconstruction from Audio and Text Embeddings
Evaluation and Statistical validation
Results
Evaluation of Correlation Metrics
Statistical Validation
Discussion
Brain Representation
Conclusions and Future Directions
Acknowledgements

Figures (4)

Figure 1: Schematic representation of the encoding pipelines. The top part of the figure is structured as follows: Left: initial input stimulus (audio). Centre: two different encoders individually process the auditory stimulus to generate embeddings. The first encoder uses time-frequency decompositions to extract features, while the second encoder uses the wav2vec2 library to convert the audio data into latent representations. Right: prediction of MEG time-frequency decompositions (e.g., spectrogram) using ridge regression with the embeddings as dependent variables. For text stimuli (bottom part of the figure), the same pipeline is adapted with CLIP and GPT-2 models as encoders, which generate embeddings that capture the semantic content and contextual information of the text. This setup allows us to compare and analyze how different types of stimuli are represented and processed in the brain.
Figure 2: Pearson Correlation topography maps (subject-wise average), visualizing the performance of all encoding strategies (TFD, wav2vec2, CLIP, GPT-2) performance for every sensor and frequency band, as well as the full spectrum case ("complete"). In the case of audio encoders, high values of correlation occur in lateral brain areas, while textual models exhibit significant performance also in frontal regions. The performance varies notably across frequency bands, particularly in lower frequencies, which are typically associated with states of rest or sleep rather than concentration and cognitive processing.
Figure 3: Topographic maps of PC Z-scores for the entire frequency range (complete, 0-30 Hz) across four different models (TFD, wav2vec2, CLIP, GPT-2). Each map illustrates the spatial distribution of Z-scores on the scalp, where Z-scores quantify the degree of association between the real and predicted MEG spectrograms. Higher Z-scores (indicated by green colour) represent a stronger positive association and suggest that the model's predictions are significantly different from what would be expected by chance. Conversely, negative Z-scores (marked by red colour) also indicate significant non-random associations, however in the opposite direction, highlighting regions where the model's predictions systematically deviate from the observed data. These maps provide a visual representation of the non-random patterns of brain activity predicted by each model, emphasizing the regions where the models' predictions are most robust and significant.
Figure 4: Violin plots depicting the distributions of Z-transformed PC values obtained by regressing real against predicted TFDs for 4 different encoding models (TFD, wav2vec2, CLIP, GPT-2).

Bridging Auditory Perception and Language Comprehension through MEG-Driven Encoding Models

TL;DR

Abstract

Bridging Auditory Perception and Language Comprehension through MEG-Driven Encoding Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)