Table of Contents
Fetching ...

Proficiency-Aware Adaptation and Data Augmentation for Robust L2 ASR

Ling Sun, Charlotte Zhu, Shuju Shi

TL;DR

This work addresses the inequity of general-purpose ASR for L2 speakers by treating proficiency as a core latent variable and evaluating on the CEFR-graded Speak & Improve corpus. It reveals that naive fine-tuning can lower average WER but worsens disparities across proficiency levels. To counter this, the authors introduce proficiency-aware multitask learning and targeted spectrogram augmentation, which together reduce WER and, crucially, narrow time-sensitive errors for low-proficiency learners. The combined approach achieves up to a 29.4% relative WER reduction and substantially lowers insertion/deletion errors, advancing equitable ASR for L2 learners with practical implications for education and accessibility. These findings highlight proficiency awareness as essential for fair and effective L2 ASR and point to directions for improving proficiency classification and accent-robust modeling in future work.

Abstract

General-purpose ASR underperforms for atypical speakers, such as L2 learners, reinforcing bias and limiting use in education and accessibility. Using the CEFR-graded Speak and Improve corpus, we show that naive fine-tuning of Whisper reduces average WER but simultaneously widens disparities and disproportionately harms lower-level learners. To address this, we propose two strategies: (i) proficiency-aware multitask learning, jointly optimizing ASR with proficiency classification, and (ii) targeted augmentation, applying spectrogram masking to low-proficiency speech to counter imbalance. These approaches reduce WER by up to 29.4 percent (relative) and insertion/deletion errors by as much as 58.6 percent (relative). Crucially, despite the severe imbalance of the dataset reflecting real-world distributions, both strategies consistently narrow proficiency gaps, advancing equitable ASR for L2 learners.

Proficiency-Aware Adaptation and Data Augmentation for Robust L2 ASR

TL;DR

This work addresses the inequity of general-purpose ASR for L2 speakers by treating proficiency as a core latent variable and evaluating on the CEFR-graded Speak & Improve corpus. It reveals that naive fine-tuning can lower average WER but worsens disparities across proficiency levels. To counter this, the authors introduce proficiency-aware multitask learning and targeted spectrogram augmentation, which together reduce WER and, crucially, narrow time-sensitive errors for low-proficiency learners. The combined approach achieves up to a 29.4% relative WER reduction and substantially lowers insertion/deletion errors, advancing equitable ASR for L2 learners with practical implications for education and accessibility. These findings highlight proficiency awareness as essential for fair and effective L2 ASR and point to directions for improving proficiency classification and accent-robust modeling in future work.

Abstract

General-purpose ASR underperforms for atypical speakers, such as L2 learners, reinforcing bias and limiting use in education and accessibility. Using the CEFR-graded Speak and Improve corpus, we show that naive fine-tuning of Whisper reduces average WER but simultaneously widens disparities and disproportionately harms lower-level learners. To address this, we propose two strategies: (i) proficiency-aware multitask learning, jointly optimizing ASR with proficiency classification, and (ii) targeted augmentation, applying spectrogram masking to low-proficiency speech to counter imbalance. These approaches reduce WER by up to 29.4 percent (relative) and insertion/deletion errors by as much as 58.6 percent (relative). Crucially, despite the severe imbalance of the dataset reflecting real-world distributions, both strategies consistently narrow proficiency gaps, advancing equitable ASR for L2 learners.

Paper Structure

This paper contains 12 sections, 1 equation, 3 figures, 1 table.

Figures (3)

  • Figure 1: Word distribution in the S&I corpus by label category. A2–C1 indicate CEFR proficiency levels (A2 = low, C1 = high). Q3–Q5 denote audio quality (Q3 = low, Q5 = high).
  • Figure 2: The pipelines used in this study, including the Whisper baseline and three mitigation strategies.
  • Figure 3: WER breakdown by error type over the development and evaluation dataset across five systems: baseline Whisper-small, LoRA fine-tuning, LoRA with auxiliary proficiency head, LoRA with data augmentation, and LoRA with both multitask learning and augmentation.