Table of Contents
Fetching ...

MultiPA: A Multi-task Speech Pronunciation Assessment Model for Open Response Scenarios

Yu-Wen Chen, Zhou Yu, Julia Hirschberg

TL;DR

Open-response pronunciation assessment requires evaluating multiple facets of speech. MultiPA addresses this by fine-tuning a pretrained HuBERT SSL model with dedicated word- and sentence-level heads and rich feature fusion from ASR-based transcripts and alignment. The model delivers multitask scores for sentence accuracy, fluency, prosody, and word-level accuracy, achieving state-of-the-art performance on speechocean762 and generalizing to newly collected out-of-domain multiPA data. The study includes a real-world pilot data collection with expert annotations and analyzes correlations between tasks to understand multitask benefits. Limitations include reliance on ASR transcripts for word-level scoring, motivating future work in data augmentation and self-supervised methods, and plans to release the pilot data for benchmarking.

Abstract

Pronunciation assessment models designed for open response scenarios enable users to practice language skills in a manner similar to real-life communication. However, previous open-response pronunciation assessment models have predominantly focused on a single pronunciation task, such as sentence-level accuracy, rather than offering a comprehensive assessment in various aspects. We propose MultiPA, a Multitask Pronunciation Assessment model that provides sentence-level accuracy, fluency, prosody, and word-level accuracy assessment for open responses. We examined the correlation between different pronunciation tasks and showed the benefits of multi-task learning. Our model reached the state-of-the-art performance on existing in-domain data sets and effectively generalized to an out-of-domain dataset that we newly collected. The experimental results demonstrate the practical utility of our model in real-world applications.

MultiPA: A Multi-task Speech Pronunciation Assessment Model for Open Response Scenarios

TL;DR

Open-response pronunciation assessment requires evaluating multiple facets of speech. MultiPA addresses this by fine-tuning a pretrained HuBERT SSL model with dedicated word- and sentence-level heads and rich feature fusion from ASR-based transcripts and alignment. The model delivers multitask scores for sentence accuracy, fluency, prosody, and word-level accuracy, achieving state-of-the-art performance on speechocean762 and generalizing to newly collected out-of-domain multiPA data. The study includes a real-world pilot data collection with expert annotations and analyzes correlations between tasks to understand multitask benefits. Limitations include reliance on ASR transcripts for word-level scoring, motivating future work in data augmentation and self-supervised methods, and plans to release the pilot data for benchmarking.

Abstract

Pronunciation assessment models designed for open response scenarios enable users to practice language skills in a manner similar to real-life communication. However, previous open-response pronunciation assessment models have predominantly focused on a single pronunciation task, such as sentence-level accuracy, rather than offering a comprehensive assessment in various aspects. We propose MultiPA, a Multitask Pronunciation Assessment model that provides sentence-level accuracy, fluency, prosody, and word-level accuracy assessment for open responses. We examined the correlation between different pronunciation tasks and showed the benefits of multi-task learning. Our model reached the state-of-the-art performance on existing in-domain data sets and effectively generalized to an out-of-domain dataset that we newly collected. The experimental results demonstrate the practical utility of our model in real-world applications.
Paper Structure (17 sections, 3 figures, 2 tables)

This paper contains 17 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of MultiPA, where $d$ in Linear and Conv1d layers refers to the output dimension, $k$ is the kernel size, and $h$ indicates the number of heads. The selection of $h$ is based on empirical results.
  • Figure 2: Ablation studies for using different ASR models.
  • Figure 3: Correlation between different pronunciation tasks. A, F, P refer to accuracy, fluency, and prosody, respectively.