persoDA: Personalized Data Augmentation for Personalized ASR
Pablo Peso Parada, Spyros Fontalis, Md Asif Jalal, Karthikeyan Saravanan, Anastasios Drosou, Mete Ozay, Gil Ho Lee, Jungin Lee, Seokyeong Jung
TL;DR
persoDA tackles the mismatch between training and test conditions in on-device ASR by learning augmentation parameters $\theta$ from a user’s data to mirror their acoustic environment $E$. It decomposes into two personalized augmentation streams: persoNoise for user-specific background noise and persoReverb for room reverberation, each derived from unlabeled user recordings using VAD, speech separation, and T60-based RIR selection. Empirical results on LibriSpeech and VOiCES show that persoDA yields a relative WER reduction of about 13.9% over standard data augmentation and accelerates convergence by 16–20% compared with MCT. The approach requires minimal storage (a few seconds of RIRs) and demonstrates practical benefits for fast, on-device personalization with limited user data.
Abstract
Data augmentation (DA) is ubiquitously used in training of Automatic Speech Recognition (ASR) models. DA offers increased data variability, robustness and generalization against different acoustic distortions. Recently, personalization of ASR models on mobile devices has been shown to improve Word Error Rate (WER). This paper evaluates data augmentation in this context and proposes persoDA; a DA method driven by user's data utilized to personalize ASR. persoDA aims to augment training with data specifically tuned towards acoustic characteristics of the end-user, as opposed to standard augmentation based on Multi-Condition Training (MCT) that applies random reverberation and noises. Our evaluation with an ASR conformer-based baseline trained on Librispeech and personalized for VOICES shows that persoDA achieves a 13.9% relative WER reduction over using standard data augmentation (using random noise & reverberation). Furthermore, persoDA shows 16% to 20% faster convergence over MCT.
