Table of Contents
Fetching ...

Segmentation-free Goodness of Pronunciation

Xinwei Cao, Zijian Fan, Torbjørn Svendsen, Giampiero Salvi

TL;DR

The paper tackles phoneme-level mispronunciation detection by removing strict dependence on fixed speech segmentation. It develops two segmentation-free GOP frameworks, GOP-SA and GOP-SF, leveraging CTC-based ASR to enable robust GOP evaluation despite peaky activations and alignment uncertainties. The authors provide rigorous derivations, normalization schemes, and efficient forward-algorithm implementations, plus feature extraction (FGOP-SF) for downstream MDD models. Experimental results on CMU Kids and speechocean762 show that segmentation-free GOP, especially with SD and SDI variants, improves pronunciation assessment and achieves competitive, state-of-the-art-like performance with simpler, scalable methods. These contributions enable more accurate, deployment-friendly MDD in CALL systems by integrating modern end-to-end ASR with phoneme-level evaluation without brittle segmentation assumptions.

Abstract

Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer-aided language learning (CALL) systems. Most systems implementing phoneme-level MDD through goodness of pronunciation (GOP), however, rely on pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general segmentation-free method that takes all possible segmentations of the canonical transcription into account (GOP-SF). We give a theoretical account of our definition of GOP-SF, an implementation that solves potential numerical issues as well as a proper normalization which allows the use of acoustic models with different peakiness over time. We provide extensive experimental results on the CMU Kids and speechocean762 datasets comparing the different definitions of our methods, estimating the dependency of GOP-SF on the peakiness of the acoustic models and on the amount of context around the target phoneme. Finally, we compare our methods with recent studies over the speechocean762 data showing that the feature vectors derived from the proposed method achieve state-of-the-art results on phoneme-level pronunciation assessment.

Segmentation-free Goodness of Pronunciation

TL;DR

The paper tackles phoneme-level mispronunciation detection by removing strict dependence on fixed speech segmentation. It develops two segmentation-free GOP frameworks, GOP-SA and GOP-SF, leveraging CTC-based ASR to enable robust GOP evaluation despite peaky activations and alignment uncertainties. The authors provide rigorous derivations, normalization schemes, and efficient forward-algorithm implementations, plus feature extraction (FGOP-SF) for downstream MDD models. Experimental results on CMU Kids and speechocean762 show that segmentation-free GOP, especially with SD and SDI variants, improves pronunciation assessment and achieves competitive, state-of-the-art-like performance with simpler, scalable methods. These contributions enable more accurate, deployment-friendly MDD in CALL systems by integrating modern end-to-end ASR with phoneme-level evaluation without brittle segmentation assumptions.

Abstract

Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer-aided language learning (CALL) systems. Most systems implementing phoneme-level MDD through goodness of pronunciation (GOP), however, rely on pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general segmentation-free method that takes all possible segmentations of the canonical transcription into account (GOP-SF). We give a theoretical account of our definition of GOP-SF, an implementation that solves potential numerical issues as well as a proper normalization which allows the use of acoustic models with different peakiness over time. We provide extensive experimental results on the CMU Kids and speechocean762 datasets comparing the different definitions of our methods, estimating the dependency of GOP-SF on the peakiness of the acoustic models and on the amount of context around the target phoneme. Finally, we compare our methods with recent studies over the speechocean762 data showing that the feature vectors derived from the proposed method achieve state-of-the-art results on phoneme-level pronunciation assessment.

Paper Structure

This paper contains 25 sections, 23 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Standard GOP methods use forced alignment to segment speech for analysis.
  • Figure 2: Illustration of the issues with standard GOP methods for models trained with CE loss (left) and CTC loss (right). Each plot corresponds to a different kind of mispronunciation and is divided in three parts: "sp" shows that segments that were actually spoken; "fa" shows the forced alignment based on the canonical pronunciation; "ac" shows the activations from the model and for the canonical segments. Finally, for the CTC models, the red dashed lines show the alternative forced alignment from the GOP-CTC-align method.
  • Figure 3: The figure exemplifies the computation of GOP-SF (Eq. \ref{['eq:gop-sf_implementation']}) using a CTC-trained model. In the example, $\mathcal{L}_{\text{CTC}}(L_{\text{C}}\xspace, O_1^T)$ can be computed using the graph at the top, whereas $\mathcal{L}_{\text{CTC}}(L_{\text{SDI}}\xspace, O_1^T)$ using the graph at the bottom. For the sake of simplicity we omit the skip connections and self-loop connections.
  • Figure 4: AUC versus context length for simulated and real errors on the CMU Kids data. The shaded area shows 95% confidence intervals computed with HanleyAndMcNeil1982ROC-CI.