Table of Contents
Fetching ...

The Universal Personalizer: Few-Shot Dysarthric Speech Recognition via Meta-Learning

Dhruuv Agarwal, Harry Zhang, Yang Yu, Quan Wang

TL;DR

The Universal Personalizer addresses the practical challenge of personalizing dysarthric ASR without burdensome per-user training by reframing personalization as in-context learning with a single, fixed model. By combining a mixed 0-shot/10-shot meta-training regime on a Gemini 2.5 Flash base, the approach enables zero-shot and few-shot transcription with context-provided support sets, achieving notable WER improvements on large dysarthric datasets. Key contributions include state-of-the-art results on Euphonia and SAP Test-1, a thorough analysis of enrollment curation versus dynamic retrieval, and data-efficient insights into speaker vs. domain adaptation. The work demonstrates that instant, high-quality, personalization is feasible at scale, reducing the barrier for millions of dysarthric speakers and pointing to future directions in dynamic acoustic retrieval to close remaining gaps.

Abstract

Personalizing dysarthric ASR is hindered by demanding enrollment collection and per-user training. We propose a hybrid meta-training method for a single model, enabling zero-shot and few-shot on-the-fly personalization via in-context learning (ICL). On Euphonia, it achieves 13.9% Word Error Rate (WER), surpassing speaker-independent baselines (17.5%). On SAP Test-1, our 5.3% WER outperforms the challenge-winning team (5.97%). On Test-2, our 9.49% trails only the winner (8.11%) but without relying on techniques like offline model-merging or custom audio chunking. Curation yields a 40% WER reduction using random same-speaker examples, validating active personalization. While static text curation fails to beat this baseline, oracle similarity reveals substantial headroom, highlighting dynamic acoustic retrieval as the next frontier. Data ablations confirm rapid low-resource speaker adaptation, establishing the model as a practical personalized solution.

The Universal Personalizer: Few-Shot Dysarthric Speech Recognition via Meta-Learning

TL;DR

The Universal Personalizer addresses the practical challenge of personalizing dysarthric ASR without burdensome per-user training by reframing personalization as in-context learning with a single, fixed model. By combining a mixed 0-shot/10-shot meta-training regime on a Gemini 2.5 Flash base, the approach enables zero-shot and few-shot transcription with context-provided support sets, achieving notable WER improvements on large dysarthric datasets. Key contributions include state-of-the-art results on Euphonia and SAP Test-1, a thorough analysis of enrollment curation versus dynamic retrieval, and data-efficient insights into speaker vs. domain adaptation. The work demonstrates that instant, high-quality, personalization is feasible at scale, reducing the barrier for millions of dysarthric speakers and pointing to future directions in dynamic acoustic retrieval to close remaining gaps.

Abstract

Personalizing dysarthric ASR is hindered by demanding enrollment collection and per-user training. We propose a hybrid meta-training method for a single model, enabling zero-shot and few-shot on-the-fly personalization via in-context learning (ICL). On Euphonia, it achieves 13.9% Word Error Rate (WER), surpassing speaker-independent baselines (17.5%). On SAP Test-1, our 5.3% WER outperforms the challenge-winning team (5.97%). On Test-2, our 9.49% trails only the winner (8.11%) but without relying on techniques like offline model-merging or custom audio chunking. Curation yields a 40% WER reduction using random same-speaker examples, validating active personalization. While static text curation fails to beat this baseline, oracle similarity reveals substantial headroom, highlighting dynamic acoustic retrieval as the next frontier. Data ablations confirm rapid low-resource speaker adaptation, establishing the model as a practical personalized solution.

Paper Structure

This paper contains 22 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Overview. Left to Right: Architecture; Example construction; Mixed MetaICL training & evaluation with example curation.