Table of Contents
Fetching ...

Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation Assessment

Heejin Do, Wonjun Lee, Gary Geunbae Lee

TL;DR

The paper tackles data scarcity and severe label imbalance in multi-aspect pronunciation assessment by introducing Acoustic Feature Mixup (AM) with static ($AM_{stat}$) and dynamic ($AM_{dyn}$) policies that operate on GOP-based acoustic features. By incorporating in-batch mean mixing and non-linear interpolation, along with error-rate features from ASR (CER and MER), the method synthesizes balanced training signals without requiring additional speech data. Evaluations on speechocean762 show significant improvements across phoneme, word, and utterance levels, with dynamic AM plus error-rate features delivering the strongest gains and revealing the importance of mixing direction for balanced learning. The approach offers practical benefits for robust, aspect-balanced pronunciation scoring and hints at resilience to unseen distortions in real-world data.

Abstract

In automated pronunciation assessment, recent emphasis progressively lies on evaluating multiple aspects to provide enriched feedback. However, acquiring multi-aspect-score labeled data for non-native language learners' speech poses challenges; moreover, it often leads to score-imbalanced distributions. In this paper, we propose two Acoustic Feature Mixup strategies, linearly and non-linearly interpolating with the in-batch averaged feature, to address data scarcity and score-label imbalances. Primarily using goodness-of-pronunciation as an acoustic feature, we tailor mixup designs to suit pronunciation assessment. Further, we integrate fine-grained error-rate features by comparing speech recognition results with the original answer phonemes, giving direct hints for mispronunciation. Effective mixing of the acoustic features notably enhances overall scoring performances on the speechocean762 dataset, and detailed analysis highlights our potential to predict unseen distortions.

Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation Assessment

TL;DR

The paper tackles data scarcity and severe label imbalance in multi-aspect pronunciation assessment by introducing Acoustic Feature Mixup (AM) with static () and dynamic () policies that operate on GOP-based acoustic features. By incorporating in-batch mean mixing and non-linear interpolation, along with error-rate features from ASR (CER and MER), the method synthesizes balanced training signals without requiring additional speech data. Evaluations on speechocean762 show significant improvements across phoneme, word, and utterance levels, with dynamic AM plus error-rate features delivering the strongest gains and revealing the importance of mixing direction for balanced learning. The approach offers practical benefits for robust, aspect-balanced pronunciation scoring and hints at resilience to unseen distortions in real-world data.

Abstract

In automated pronunciation assessment, recent emphasis progressively lies on evaluating multiple aspects to provide enriched feedback. However, acquiring multi-aspect-score labeled data for non-native language learners' speech poses challenges; moreover, it often leads to score-imbalanced distributions. In this paper, we propose two Acoustic Feature Mixup strategies, linearly and non-linearly interpolating with the in-batch averaged feature, to address data scarcity and score-label imbalances. Primarily using goodness-of-pronunciation as an acoustic feature, we tailor mixup designs to suit pronunciation assessment. Further, we integrate fine-grained error-rate features by comparing speech recognition results with the original answer phonemes, giving direct hints for mispronunciation. Effective mixing of the acoustic features notably enhances overall scoring performances on the speechocean762 dataset, and detailed analysis highlights our potential to predict unseen distortions.
Paper Structure (16 sections, 4 equations, 3 figures, 3 tables)

This paper contains 16 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An example of GOP features, log phone posterior (LPP) and log posterior ratio (LPR), shift after applying dynamic Mixup.
  • Figure 2: The utterance-level score-label distribution shift when $AM_{stat}$ with fixed $\lambda$=0.3 (a), $AM_{stat}$ with $\lambda \sim Beta(\alpha,\alpha)$ (b), and $AM_{dyn}$ (c) are applied, respectively. blue and pink bars denote original and mixed-up distribution, respectively.
  • Figure 3: Score-label distribution shift when $AM_{dyn}$ is applied with the original and the reverse directions (left), and PCC performance and standard deviation of PCC of aspects within each granularity level (right).