Table of Contents
Fetching ...

Data-Efficient ASR Personalization for Non-Normative Speech Using an Uncertainty-Based Phoneme Difficulty Score for Guided Sampling

Niclas Pokel, Pehuén Moure, Roman Böhringer, Yingqiang Gao

Abstract

ASR systems struggle with non-normative speech due to high acoustic variability and data scarcity. We propose a data-efficient method using phoneme-level uncertainty to guide fine-tuning for personalization. Instead of computationally expensive ensembles, we leverage Variational Low-Rank Adaptation (VI LoRA) to estimate epistemic uncertainty in foundation models. These estimates form a composite Phoneme Difficulty Score (PhDScore) that drives a targeted oversampling strategy. Evaluated on English and German datasets, including a longitudinal analysis against two clinical reports taken one year apart, we demonstrate that: (1) VI LoRA-based uncertainty aligns better with expert clinical assessments than standard entropy; (2) PhDScore captures stable, persistent articulatory difficulties; and (3) uncertainty-guided sampling significantly improves ASR accuracy for impaired speech.

Data-Efficient ASR Personalization for Non-Normative Speech Using an Uncertainty-Based Phoneme Difficulty Score for Guided Sampling

Abstract

ASR systems struggle with non-normative speech due to high acoustic variability and data scarcity. We propose a data-efficient method using phoneme-level uncertainty to guide fine-tuning for personalization. Instead of computationally expensive ensembles, we leverage Variational Low-Rank Adaptation (VI LoRA) to estimate epistemic uncertainty in foundation models. These estimates form a composite Phoneme Difficulty Score (PhDScore) that drives a targeted oversampling strategy. Evaluated on English and German datasets, including a longitudinal analysis against two clinical reports taken one year apart, we demonstrate that: (1) VI LoRA-based uncertainty aligns better with expert clinical assessments than standard entropy; (2) PhDScore captures stable, persistent articulatory difficulties; and (3) uncertainty-guided sampling significantly improves ASR accuracy for impaired speech.

Paper Structure

This paper contains 13 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: We compare LoRA and full fine-tuning (FT) with (w) against without (w/o) oversampling (OS). Negative values (blue) indicate an improvement on BF-Sprache, while positive values (red) show forgetting of normative speech.
  • Figure 2: Longitudinal Clinical Validation. Top Row: Precision-Recall curves for Pre-trained models against Assessment 1 (Left) and Assessment 2 (Right). Solid lines (PhDScore) consistently outperform dotted lines (Entropy), with VI LoRA (Red) achieving the highest alignment (AP=0.82). Bottom Left: Fine-tuning collapses the correlation, indicating uncertainty resolution. Bottom Right: Summary of AP scores, highlighting the superiority of PhDScore over Entropy and the effect of fine-tuning.