Table of Contents
Fetching ...

Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations

Haitong Sun, Stephen McIntosh, Kwanghee Choi, Eunjung Yeo, Daisuke Saito, Nobuaki Minematsu

Abstract

Speech representations from self-supervised speech models (S3Ms) are known to be sensitive to phonemic contrasts, but their sensitivity to prosodic contrasts has not been directly measured. The ABX discrimination task has been used to measure phonemic contrast in S3M representations via minimal pairs. We introduce prosodic ABX, an extension of this framework to evaluate prosodic contrast with only a handful of examples and no explicit labels. Also, we build and release a dataset of English and Japanese minimal pairs and use it along with a Mandarin dataset to evaluate contrast in English stress, Japanese pitch accent, and Mandarin tone. Finally, we show that model and layer rankings are often preserved across several experimental conditions, making it practical for low-resource settings.

Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations

Abstract

Speech representations from self-supervised speech models (S3Ms) are known to be sensitive to phonemic contrasts, but their sensitivity to prosodic contrasts has not been directly measured. The ABX discrimination task has been used to measure phonemic contrast in S3M representations via minimal pairs. We introduce prosodic ABX, an extension of this framework to evaluate prosodic contrast with only a handful of examples and no explicit labels. Also, we build and release a dataset of English and Japanese minimal pairs and use it along with a Mandarin dataset to evaluate contrast in English stress, Japanese pitch accent, and Mandarin tone. Finally, we show that model and layer rankings are often preserved across several experimental conditions, making it practical for low-resource settings.

Paper Structure

This paper contains 17 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Our prosodic ABX framework.
  • Figure 2: ABX error rates ($\downarrow$) across different prosodic tasks. We compare the best layer of each S3M (orange box plots, $n=17$), acoustic baselines (violet dots), and humans (pink 95% confidence intervals).
  • Figure 3: Word-level ABX error rates for English lexical stress: human participants vs S3Ms.
  • Figure 4: Error-rate correlation across prosodic tasks. For all S3Ms, each layer is plotted according to its error rates.