Table of Contents
Fetching ...

Perception of Phonological Assimilation by Neural Speech Recognition Models

Charlotte Pouw, Marianne de Heer Kloots, Afra Alishahi, Willem Zuidema

Abstract

Human listeners effortlessly compensate for phonological changes during speech perception, often unconsciously inferring the intended sounds. For example, listeners infer the underlying /n/ when hearing an utterance such as "clea[m] pan", where [m] arises from place assimilation to the following labial [p]. This article explores how the neural speech recognition model Wav2Vec2 perceives assimilated sounds, and identifies the linguistic knowledge that is implemented by the model to compensate for assimilation during Automatic Speech Recognition (ASR). Using psycholinguistic stimuli, we systematically analyze how various linguistic context cues influence compensation patterns in the model's output. Complementing these behavioral experiments, our probing experiments indicate that the model shifts its interpretation of assimilated sounds from their acoustic form to their underlying form in its final layers. Finally, our causal intervention experiments suggest that the model relies on minimal phonological context cues to accomplish this shift. These findings represent a step towards better understanding the similarities and differences in phonological processing between neural ASR models and humans.

Perception of Phonological Assimilation by Neural Speech Recognition Models

Abstract

Human listeners effortlessly compensate for phonological changes during speech perception, often unconsciously inferring the intended sounds. For example, listeners infer the underlying /n/ when hearing an utterance such as "clea[m] pan", where [m] arises from place assimilation to the following labial [p]. This article explores how the neural speech recognition model Wav2Vec2 perceives assimilated sounds, and identifies the linguistic knowledge that is implemented by the model to compensate for assimilation during Automatic Speech Recognition (ASR). Using psycholinguistic stimuli, we systematically analyze how various linguistic context cues influence compensation patterns in the model's output. Complementing these behavioral experiments, our probing experiments indicate that the model shifts its interpretation of assimilated sounds from their acoustic form to their underlying form in its final layers. Finally, our causal intervention experiments suggest that the model relies on minimal phonological context cues to accomplish this shift. These findings represent a step towards better understanding the similarities and differences in phonological processing between neural ASR models and humans.
Paper Structure (28 sections, 12 figures, 2 tables)

This paper contains 28 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Compensation rate (i.e., the proportion of stimuli for which the underlying consonant is transcribed) of Wav2Vec2 and humans in viable and unviable assimilation contexts (e.g., clea[m] pan versus clea[m] spoon, respectively). In the control condition, the target consonant is not assimilated (e.g., clea[n] fork). N = 48 for each condition. Error bars denote the 95% Wilson confidence interval.
  • Figure 2: Compensation rate (i.e., the proportion of stimuli for which the underlying consonant is transcribed) of Wav2Vec2 in viable and unviable assimilation contexts (e.g., ru[m] picks versus ru[m] does, respectively), with different types of preceding sentential context (neutral context, biasing context, and random context). N = 38 for each condition. Error bars denote the 95% Wilson confidence interval.
  • Figure 3: Comparison between compensation behavior of Wav2Vec2 and human participants from gaskell2001lexical. The left side of the figure shows the effect of semantic context (neutral vs. biasing) on compensation behavior; the right side of the figure shows the effect of phonological context (viable vs. unviable).
  • Figure 4: Accuracy of binary probing classifiers, trained and evaluated on frame-level representations from individual Wav2Vec2 layers (extracted using the TIMIT corpus). Each classifier has to distinguish between two candidate phoneme labels (indicated in the legend).
  • Figure 5: Layerwise preference of binary linear probing classifiers for the underlying consonant /n/ or the surface consonant (top: /m/, bottom: /N/) given Wav2Vec2 representations at the position of the assimilated consonant. The three line colors indicate whether the model compensated for the assimilation in its final transcription. Error bars denote the standard error of the mean.
  • ...and 7 more figures