Segmentation-free Goodness of Pronunciation

Xinwei Cao; Zijian Fan; Torbjørn Svendsen; Giampiero Salvi

Segmentation-free Goodness of Pronunciation

Xinwei Cao, Zijian Fan, Torbjørn Svendsen, Giampiero Salvi

TL;DR

The paper tackles phoneme-level mispronunciation detection by removing strict dependence on fixed speech segmentation. It develops two segmentation-free GOP frameworks, GOP-SA and GOP-SF, leveraging CTC-based ASR to enable robust GOP evaluation despite peaky activations and alignment uncertainties. The authors provide rigorous derivations, normalization schemes, and efficient forward-algorithm implementations, plus feature extraction (FGOP-SF) for downstream MDD models. Experimental results on CMU Kids and speechocean762 show that segmentation-free GOP, especially with SD and SDI variants, improves pronunciation assessment and achieves competitive, state-of-the-art-like performance with simpler, scalable methods. These contributions enable more accurate, deployment-friendly MDD in CALL systems by integrating modern end-to-end ASR with phoneme-level evaluation without brittle segmentation assumptions.

Abstract

Mispronunciation detection and diagnosis (MDD) is a significant part in modern computer-aided language learning (CALL) systems. Most systems implementing phoneme-level MDD through goodness of pronunciation (GOP), however, rely on pre-segmentation of speech into phonetic units. This limits the accuracy of these methods and the possibility to use modern CTC-based acoustic models for their evaluation. In this study, we first propose self-alignment GOP (GOP-SA) that enables the use of CTC-trained ASR models for MDD. Next, we define a more general segmentation-free method that takes all possible segmentations of the canonical transcription into account (GOP-SF). We give a theoretical account of our definition of GOP-SF, an implementation that solves potential numerical issues as well as a proper normalization which allows the use of acoustic models with different peakiness over time. We provide extensive experimental results on the CMU Kids and speechocean762 datasets comparing the different definitions of our methods, estimating the dependency of GOP-SF on the peakiness of the acoustic models and on the amount of context around the target phoneme. Finally, we compare our methods with recent studies over the speechocean762 data showing that the feature vectors derived from the proposed method achieve state-of-the-art results on phoneme-level pronunciation assessment.

Segmentation-free Goodness of Pronunciation

TL;DR

Abstract

Segmentation-free Goodness of Pronunciation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)