Table of Contents
Fetching ...

A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models

Antón de la Fuente, Dan Jurafsky

TL;DR

The paper investigates how self-supervised speech models encode suprasegmental information—Mandarin lexical tone, English lexical stress, and English phrasal pitch accents—via layer-wise probing of wav2vec 2.0 (English and Mandarin), HuBERT, and WavLM. Probes are trained on layer outputs (0–12) to predict syllable-level labels from Switchboard and GTMC corpora, revealing that abstract suprasegmental representations peak in the middle layers and are largely independent of raw F0 cues. Language specificity emerges in the context network, with matched-language pretraining yielding stronger gains, and ASR fine-tuning enhancing late-layer representations for lexical features like tone and stress more than for phrasal accents. Across models and pretraining tasks, the results suggest generalizable, abstract, context-driven representations of suprasegmentals with limited dependence on orthography or surface acoustics, informing transfer and cross-linguistic prosody modeling.

Abstract

This study asks how self-supervised speech models represent suprasegmental categories like Mandarin lexical tone, English lexical stress, and English phrasal accents. Through a series of probing tasks, we make layer-wise comparisons of English and Mandarin 12 layer monolingual models. Our findings suggest that 1) English and Mandarin wav2vec 2.0 models learn contextual representations of abstract suprasegmental categories which are strongest in the middle third of the network. 2) Models are better at representing features that exist in the language of their training data, and this difference is driven by enriched context in transformer blocks, not local acoustic representation. 3) Fine-tuned wav2vec 2.0 improves performance in later layers compared to pre-trained models mainly for lexically contrastive features like tone and stress, 4) HuBERT and WavLM learn similar representations to wav2vec 2.0, differing mainly in later layer performance. Our results extend previous understanding of how models represent suprasegmentals and offer new insights into the language-specificity and contextual nature of these representations.

A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models

TL;DR

The paper investigates how self-supervised speech models encode suprasegmental information—Mandarin lexical tone, English lexical stress, and English phrasal pitch accents—via layer-wise probing of wav2vec 2.0 (English and Mandarin), HuBERT, and WavLM. Probes are trained on layer outputs (0–12) to predict syllable-level labels from Switchboard and GTMC corpora, revealing that abstract suprasegmental representations peak in the middle layers and are largely independent of raw F0 cues. Language specificity emerges in the context network, with matched-language pretraining yielding stronger gains, and ASR fine-tuning enhancing late-layer representations for lexical features like tone and stress more than for phrasal accents. Across models and pretraining tasks, the results suggest generalizable, abstract, context-driven representations of suprasegmentals with limited dependence on orthography or surface acoustics, informing transfer and cross-linguistic prosody modeling.

Abstract

This study asks how self-supervised speech models represent suprasegmental categories like Mandarin lexical tone, English lexical stress, and English phrasal accents. Through a series of probing tasks, we make layer-wise comparisons of English and Mandarin 12 layer monolingual models. Our findings suggest that 1) English and Mandarin wav2vec 2.0 models learn contextual representations of abstract suprasegmental categories which are strongest in the middle third of the network. 2) Models are better at representing features that exist in the language of their training data, and this difference is driven by enriched context in transformer blocks, not local acoustic representation. 3) Fine-tuned wav2vec 2.0 improves performance in later layers compared to pre-trained models mainly for lexically contrastive features like tone and stress, 4) HuBERT and WavLM learn similar representations to wav2vec 2.0, differing mainly in later layer performance. Our results extend previous understanding of how models represent suprasegmentals and offer new insights into the language-specificity and contextual nature of these representations.
Paper Structure (11 sections, 3 figures, 2 tables)

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: English monolingual wav2vec2-base (black) and Mandarin monolingual mandarin-wav2vec2 (orange) probe performance on each task. Red point indicates the best layer for each model. Dashed lines are random baselines (red), or Mel-Filterbank baselines (blue)
  • Figure 2: Fine-tuned (dashed) and pre-trained (solid) monolingual English (black) and Mandarin (orange) wav2vec 2.0 model performance on all tasks. Red point indicates the best layer for each model. Dashed lines are random baselines (red), or Mel-Filterbank baselines (blue)
  • Figure 3: English wav2vec 2.0 (black) HuBERT (tawny) and WavLM (blue) model performance on all tasks. The best layer for each model has a red point. Dashed lines are random baselines (red), or Mel-Filterbank baselines (blue)