Table of Contents
Fetching ...

A predictive learning model can simulate temporal dynamics and context effects found in neural representations of continuous speech

Oli Danyi Liu, Hao Tang, Naomi Feldman, Sharon Goldwater

TL;DR

The paper investigates whether a self-supervised predictive model can emulate neural representations of continuous speech. By training a CPC-based SSL model on large unlabeled corpora and decoding phoneme information from its representations, the authors show that the model exhibits simultaneous encoding of multiple phones with evolving representations and partial cross-context generalization, paralleling neural data. However, cross-context generalization in the model correlates with acoustic similarity, suggesting limited evidence for context-invariant phoneme representations beyond what acoustics account for. This work demonstrates that predictive learning can reproduce several brain-like properties of speech processing and highlights directions for exploring other architectures and deeper neural validation.

Abstract

Speech perception involves storing and integrating sequentially presented items. Recent work in cognitive neuroscience has identified temporal and contextual characteristics in humans' neural encoding of speech that may facilitate this temporal processing. In this study, we simulated similar analyses with representations extracted from a computational model that was trained on unlabelled speech with the learning objective of predicting upcoming acoustics. Our simulations revealed temporal dynamics similar to those in brain signals, implying that these properties can arise without linguistic knowledge. Another property shared between brains and the model is that the encoding patterns of phonemes support some degree of cross-context generalization. However, we found evidence that the effectiveness of these generalizations depends on the specific contexts, which suggests that this analysis alone is insufficient to support the presence of context-invariant encoding.

A predictive learning model can simulate temporal dynamics and context effects found in neural representations of continuous speech

TL;DR

The paper investigates whether a self-supervised predictive model can emulate neural representations of continuous speech. By training a CPC-based SSL model on large unlabeled corpora and decoding phoneme information from its representations, the authors show that the model exhibits simultaneous encoding of multiple phones with evolving representations and partial cross-context generalization, paralleling neural data. However, cross-context generalization in the model correlates with acoustic similarity, suggesting limited evidence for context-invariant phoneme representations beyond what acoustics account for. This work demonstrates that predictive learning can reproduce several brain-like properties of speech processing and highlights directions for exploring other architectures and deeper neural validation.

Abstract

Speech perception involves storing and integrating sequentially presented items. Recent work in cognitive neuroscience has identified temporal and contextual characteristics in humans' neural encoding of speech that may facilitate this temporal processing. In this study, we simulated similar analyses with representations extracted from a computational model that was trained on unlabelled speech with the learning objective of predicting upcoming acoustics. Our simulations revealed temporal dynamics similar to those in brain signals, implying that these properties can arise without linguistic knowledge. Another property shared between brains and the model is that the encoding patterns of phonemes support some degree of cross-context generalization. However, we found evidence that the effectiveness of these generalizations depends on the specific contexts, which suggests that this analysis alone is insufficient to support the presence of context-invariant encoding.
Paper Structure (15 sections, 5 figures)

This paper contains 15 sections, 5 figures.

Figures (5)

  • Figure 1: Accuracy of decoding for phoneme categories with CPC representations and logmel features. The shaded area represents the average duration of a phone.
  • Figure 2: Temporal generalization (TG) results superimposed for 4 phone positions, the first to the fourth phone in each word (p1-p4), obtained with (a) MEG signals with contours at t-value = 4 (reproduced from source data provided by Gwilliams et al.); (b) CPC representations, with accuracy contours at 0.4 (solid) and 0.2 (dotted); and (c) Log mel features, with accuracy contours at 0.2.
  • Figure 3: Generalizing from word-initial position to other word positions: (a) results from brain recordings (taken from Gwilliams et al. 2021) and (b) accuracy of our decoders. Decoders are trained on word-initial vowels (p1) and tested on vowels in p1-p4. The lefthand plot shows decoding accuracies for model representations, and the righthand plot for acoustic features. The faded lines are the baseline accuracy for each position obtained by picking the most common vowel category in the training set. (Note that /a/ is the most common category in all positions, but to differing degrees, which leads to the different baselines.)
  • Figure 4: Additional generalization tests across (a) different phone position and (b) different phonetic contexts. The faded lines represent the baseline accuracy obtained by picking the most common vowel at the training position/context.
  • Figure 5: Cross-position and cross-context generalization effects in log mel features and CPC representations correlate positively. Each circle represents the generalization effects on a test position/context.