Proficiency-Aware Adaptation and Data Augmentation for Robust L2 ASR
Ling Sun, Charlotte Zhu, Shuju Shi
TL;DR
This work addresses the inequity of general-purpose ASR for L2 speakers by treating proficiency as a core latent variable and evaluating on the CEFR-graded Speak & Improve corpus. It reveals that naive fine-tuning can lower average WER but worsens disparities across proficiency levels. To counter this, the authors introduce proficiency-aware multitask learning and targeted spectrogram augmentation, which together reduce WER and, crucially, narrow time-sensitive errors for low-proficiency learners. The combined approach achieves up to a 29.4% relative WER reduction and substantially lowers insertion/deletion errors, advancing equitable ASR for L2 learners with practical implications for education and accessibility. These findings highlight proficiency awareness as essential for fair and effective L2 ASR and point to directions for improving proficiency classification and accent-robust modeling in future work.
Abstract
General-purpose ASR underperforms for atypical speakers, such as L2 learners, reinforcing bias and limiting use in education and accessibility. Using the CEFR-graded Speak and Improve corpus, we show that naive fine-tuning of Whisper reduces average WER but simultaneously widens disparities and disproportionately harms lower-level learners. To address this, we propose two strategies: (i) proficiency-aware multitask learning, jointly optimizing ASR with proficiency classification, and (ii) targeted augmentation, applying spectrogram masking to low-proficiency speech to counter imbalance. These approaches reduce WER by up to 29.4 percent (relative) and insertion/deletion errors by as much as 58.6 percent (relative). Crucially, despite the severe imbalance of the dataset reflecting real-world distributions, both strategies consistently narrow proficiency gaps, advancing equitable ASR for L2 learners.
