MultiPA: A Multi-task Speech Pronunciation Assessment Model for Open Response Scenarios
Yu-Wen Chen, Zhou Yu, Julia Hirschberg
TL;DR
Open-response pronunciation assessment requires evaluating multiple facets of speech. MultiPA addresses this by fine-tuning a pretrained HuBERT SSL model with dedicated word- and sentence-level heads and rich feature fusion from ASR-based transcripts and alignment. The model delivers multitask scores for sentence accuracy, fluency, prosody, and word-level accuracy, achieving state-of-the-art performance on speechocean762 and generalizing to newly collected out-of-domain multiPA data. The study includes a real-world pilot data collection with expert annotations and analyzes correlations between tasks to understand multitask benefits. Limitations include reliance on ASR transcripts for word-level scoring, motivating future work in data augmentation and self-supervised methods, and plans to release the pilot data for benchmarking.
Abstract
Pronunciation assessment models designed for open response scenarios enable users to practice language skills in a manner similar to real-life communication. However, previous open-response pronunciation assessment models have predominantly focused on a single pronunciation task, such as sentence-level accuracy, rather than offering a comprehensive assessment in various aspects. We propose MultiPA, a Multitask Pronunciation Assessment model that provides sentence-level accuracy, fluency, prosody, and word-level accuracy assessment for open responses. We examined the correlation between different pronunciation tasks and showed the benefits of multi-task learning. Our model reached the state-of-the-art performance on existing in-domain data sets and effectively generalized to an out-of-domain dataset that we newly collected. The experimental results demonstrate the practical utility of our model in real-world applications.
