The Universal Personalizer: Few-Shot Dysarthric Speech Recognition via Meta-Learning
Dhruuv Agarwal, Harry Zhang, Yang Yu, Quan Wang
TL;DR
The Universal Personalizer addresses the practical challenge of personalizing dysarthric ASR without burdensome per-user training by reframing personalization as in-context learning with a single, fixed model. By combining a mixed 0-shot/10-shot meta-training regime on a Gemini 2.5 Flash base, the approach enables zero-shot and few-shot transcription with context-provided support sets, achieving notable WER improvements on large dysarthric datasets. Key contributions include state-of-the-art results on Euphonia and SAP Test-1, a thorough analysis of enrollment curation versus dynamic retrieval, and data-efficient insights into speaker vs. domain adaptation. The work demonstrates that instant, high-quality, personalization is feasible at scale, reducing the barrier for millions of dysarthric speakers and pointing to future directions in dynamic acoustic retrieval to close remaining gaps.
Abstract
Personalizing dysarthric ASR is hindered by demanding enrollment collection and per-user training. We propose a hybrid meta-training method for a single model, enabling zero-shot and few-shot on-the-fly personalization via in-context learning (ICL). On Euphonia, it achieves 13.9% Word Error Rate (WER), surpassing speaker-independent baselines (17.5%). On SAP Test-1, our 5.3% WER outperforms the challenge-winning team (5.97%). On Test-2, our 9.49% trails only the winner (8.11%) but without relying on techniques like offline model-merging or custom audio chunking. Curation yields a 40% WER reduction using random same-speaker examples, validating active personalization. While static text curation fails to beat this baseline, oracle similarity reveals substantial headroom, highlighting dynamic acoustic retrieval as the next frontier. Data ablations confirm rapid low-resource speaker adaptation, establishing the model as a practical personalized solution.
