Table of Contents
Fetching ...

Human-like Linguistic Biases in Neural Speech Models: Phonetic Categorization and Phonotactic Constraints in Wav2Vec2.0

Marianne de Heer Kloots, Willem Zuidema

TL;DR

Examines whether neural speech models implicitly encode human-like phonology and phonotactic constraints. The authors employ a controlled phonetic categorization paradigm using /l/–/r/ continua tested across Wav2Vec2 variants (ASR-finetuned, pretrained base/large, untrained, and acoustic-scene pretrained) and analyze internal representations with embedding similarities, probing classifiers, and CTC-lens. Findings show a human-like bias toward phonotactically admissible categorization, with the effect emerging in middle Transformer layers and strengthened by ASR finetuning but present in self-supervised pretraining as well. Significance lies in demonstrating that phonological knowledge can arise from self-supervised speech learning and that carefully designed stimuli plus interpretable readouts can localize linguistic biases across architectures, informing interpretability and robust ASR design.

Abstract

What do deep neural speech models know about phonology? Existing work has examined the encoding of individual linguistic units such as phonemes in these models. Here we investigate interactions between units. Inspired by classic experiments on human speech perception, we study how Wav2Vec2 resolves phonotactic constraints. We synthesize sounds on an acoustic continuum between /l/ and /r/ and embed them in controlled contexts where only /l/, only /r/, or neither occur in English. Like humans, Wav2Vec2 models show a bias towards the phonotactically admissable category in processing such ambiguous sounds. Using simple measures to analyze model internals on the level of individual stimuli, we find that this bias emerges in early layers of the model's Transformer module. This effect is amplified by ASR finetuning but also present in fully self-supervised models. Our approach demonstrates how controlled stimulus designs can help localize specific linguistic knowledge in neural speech models.

Human-like Linguistic Biases in Neural Speech Models: Phonetic Categorization and Phonotactic Constraints in Wav2Vec2.0

TL;DR

Examines whether neural speech models implicitly encode human-like phonology and phonotactic constraints. The authors employ a controlled phonetic categorization paradigm using /l/–/r/ continua tested across Wav2Vec2 variants (ASR-finetuned, pretrained base/large, untrained, and acoustic-scene pretrained) and analyze internal representations with embedding similarities, probing classifiers, and CTC-lens. Findings show a human-like bias toward phonotactically admissible categorization, with the effect emerging in middle Transformer layers and strengthened by ASR finetuning but present in self-supervised pretraining as well. Significance lies in demonstrating that phonological knowledge can arise from self-supervised speech learning and that carefully designed stimuli plus interpretable readouts can localize linguistic biases across architectures, informing interpretability and robust ASR design.

Abstract

What do deep neural speech models know about phonology? Existing work has examined the encoding of individual linguistic units such as phonemes in these models. Here we investigate interactions between units. Inspired by classic experiments on human speech perception, we study how Wav2Vec2 resolves phonotactic constraints. We synthesize sounds on an acoustic continuum between /l/ and /r/ and embed them in controlled contexts where only /l/, only /r/, or neither occur in English. Like humans, Wav2Vec2 models show a bias towards the phonotactically admissable category in processing such ambiguous sounds. Using simple measures to analyze model internals on the level of individual stimuli, we find that this bias emerges in early layers of the model's Transformer module. This effect is amplified by ASR finetuning but also present in fully self-supervised models. Our approach demonstrates how controlled stimulus designs can help localize specific linguistic knowledge in neural speech models.
Paper Structure (14 sections, 1 equation, 4 figures)

This paper contains 14 sections, 1 equation, 4 figures.

Figures (4)

  • Figure 1: Responses of the ASR-finetuned Wav2Vec2-base model to the ambiguous sounds at each step of the continua between /l/ (0) and /r/ (10). Black circles mark the crossing points at which our response measures start indicating a preference for 'R' above 'L'.
  • Figure 2: The model's output layer (T12) shows sensitivity to phonotactic context, based on preference for 'R' using embedding similarities (Eq.\ref{['eq:sim']}) or max. probability forced choices.
  • Figure 3: Phonotactic sensitivity across layers in the Wav2Vec2-base models, measured as difference in similarities to the 'R' endpoint between the tXih vs. sXih continua. A bias towards the phonotactically admissable consonant can be observed in the ASR-finetuned (ft; top row), and the speech-pretrained (pret-sp; second row) model, but not in the acoustic scenes (pret-acs) and untrained (unt) model.
  • Figure 4: Aggregated layerwise results for all analysis methods and two model sizes. Each point shows the highest difference in preference for 'R' between the tXih and sXih continua, across all intermediate continuum steps.