Table of Contents
Fetching ...

Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach

Huu Tuong Tu, Ha Viet Khanh, Tran Tien Dat, Vu Huan, Thien Van Luong, Nguyen Tien Cuong, Nguyen Thi Thu Trang

TL;DR

This work tackles mispronunciation detection and diagnosis by removing the need for phoneme-level training. It introduces PER-MDD, a training-free retrieval-based approach that builds a phoneme embedding pool from a pretrained ASR and uses cosine similarity to retrieve likely phoneme labels at test time. Through systematic ablations on the L2-ARCTIC dataset, the method achieves FRR of 4.43% and F1 of 69.60%, outperforming several baselines and demonstrating the viability of retrieval-based MDD with mid-frame pooling and larger training pools. The approach reduces training complexity while maintaining competitive diagnostic detail, making it practical for scalable deployment in language learning and speech therapy contexts.

Abstract

Mispronunciation Detection and Diagnosis (MDD) is crucial for language learning and speech therapy. Unlike conventional methods that require scoring models or training phoneme-level models, we propose a novel training-free framework that leverages retrieval techniques with a pretrained Automatic Speech Recognition model. Our method avoids phoneme-specific modeling or additional task-specific training, while still achieving accurate detection and diagnosis of pronunciation errors. Experiments on the L2-ARCTIC dataset show that our method achieves a superior F1 score of 69.60% while avoiding the complexity of model training.

Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach

TL;DR

This work tackles mispronunciation detection and diagnosis by removing the need for phoneme-level training. It introduces PER-MDD, a training-free retrieval-based approach that builds a phoneme embedding pool from a pretrained ASR and uses cosine similarity to retrieve likely phoneme labels at test time. Through systematic ablations on the L2-ARCTIC dataset, the method achieves FRR of 4.43% and F1 of 69.60%, outperforming several baselines and demonstrating the viability of retrieval-based MDD with mid-frame pooling and larger training pools. The approach reduces training complexity while maintaining competitive diagnostic detail, making it practical for scalable deployment in language learning and speech therapy contexts.

Abstract

Mispronunciation Detection and Diagnosis (MDD) is crucial for language learning and speech therapy. Unlike conventional methods that require scoring models or training phoneme-level models, we propose a novel training-free framework that leverages retrieval techniques with a pretrained Automatic Speech Recognition model. Our method avoids phoneme-specific modeling or additional task-specific training, while still achieving accurate detection and diagnosis of pronunciation errors. Experiments on the L2-ARCTIC dataset show that our method achieves a superior F1 score of 69.60% while avoiding the complexity of model training.

Paper Structure

This paper contains 12 sections, 6 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Illustration of our proposed PER-MDD method.