Table of Contents
Fetching ...

Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment

Kwanghee Choi, Eunjung Yeo, Kalvin Chang, Shinji Watanabe, David Mortensen

TL;DR

This work tackles the challenge of allophony in atypical pronunciation assessment by reframing GoP as a density estimation problem. It introduces MixGoP, a Gaussian mixture model per phoneme that models multiple allophonic subclusters using frozen self-supervised speech features (S3Ms) and log-likelihoods, removing softmax to better handle out-of-distribution inputs. Across five datasets involving dysarthric and non-native speech, MixGoP with S3M features achieves state-of-the-art performance on four of five datasets, demonstrating the advantage of modeling phoneme distributions instead of relying on unimodal posteriors. The paper further analyzes how S3M features capture allophony more effectively than traditional features and provides insights into the relationship between phonetic environment information and downstream pronunciation-scoring performance, with implications for data efficiency and cross-linguistic applicability.

Abstract

Allophony refers to the variation in the phonetic realization of a phoneme based on its phonetic environment. Modeling allophones is crucial for atypical pronunciation assessment, which involves distinguishing atypical from typical pronunciations. However, recent phoneme classifier-based approaches often simplify this by treating various realizations as a single phoneme, bypassing the complexity of modeling allophonic variation. Motivated by the acoustic modeling capabilities of frozen self-supervised speech model (S3M) features, we propose MixGoP, a novel approach that leverages Gaussian mixture models to model phoneme distributions with multiple subclusters. Our experiments show that MixGoP achieves state-of-the-art performance across four out of five datasets, including dysarthric and non-native speech. Our analysis further suggests that S3M features capture allophonic variation more effectively than MFCCs and Mel spectrograms, highlighting the benefits of integrating MixGoP with S3M features.

Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment

TL;DR

This work tackles the challenge of allophony in atypical pronunciation assessment by reframing GoP as a density estimation problem. It introduces MixGoP, a Gaussian mixture model per phoneme that models multiple allophonic subclusters using frozen self-supervised speech features (S3Ms) and log-likelihoods, removing softmax to better handle out-of-distribution inputs. Across five datasets involving dysarthric and non-native speech, MixGoP with S3M features achieves state-of-the-art performance on four of five datasets, demonstrating the advantage of modeling phoneme distributions instead of relying on unimodal posteriors. The paper further analyzes how S3M features capture allophony more effectively than traditional features and provides insights into the relationship between phonetic environment information and downstream pronunciation-scoring performance, with implications for data efficiency and cross-linguistic applicability.

Abstract

Allophony refers to the variation in the phonetic realization of a phoneme based on its phonetic environment. Modeling allophones is crucial for atypical pronunciation assessment, which involves distinguishing atypical from typical pronunciations. However, recent phoneme classifier-based approaches often simplify this by treating various realizations as a single phoneme, bypassing the complexity of modeling allophonic variation. Motivated by the acoustic modeling capabilities of frozen self-supervised speech model (S3M) features, we propose MixGoP, a novel approach that leverages Gaussian mixture models to model phoneme distributions with multiple subclusters. Our experiments show that MixGoP achieves state-of-the-art performance across four out of five datasets, including dysarthric and non-native speech. Our analysis further suggests that S3M features capture allophonic variation more effectively than MFCCs and Mel spectrograms, highlighting the benefits of integrating MixGoP with S3M features.

Paper Structure

This paper contains 44 sections, 10 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Summary of our method, MixGoP. We model the likelihood of each phoneme using a Gaussian mixture, trained on typical speech (in-distribution), to capture allophonic variations. We then evaluate on atypical speech (out-of-distribution). The y-axis represents the log-likelihood of a phoneme, where lower values indicate greater atypicality.
  • Figure 2: Visualization of WavLM-Large features of the /AH/ phoneme in the TORGO healthy subset. Phonemes are indicated using ARPABET. We observe that /AH/ consists of subclusters, each reflecting allophones from different surrounding phonetic environments.
  • Figure 3: Normalized Mutual Information $\text{MI}(I; E)/H(E)$ between the k-means cluster indices $I$ and the phonetic environment $E$ on the TORGO healthy subset. We show the layerwise NMI for S3Ms and absolute value for MFCC and Mel spectrogram.
  • Figure 4: Comparing phonetic environment information with the downstream task performance.
  • Figure 5: Kendall-tau correlation coefficient when features are extracted from different layers of S3M models.
  • ...and 2 more figures