Table of Contents
Fetching ...

persoDA: Personalized Data Augmentation for Personalized ASR

Pablo Peso Parada, Spyros Fontalis, Md Asif Jalal, Karthikeyan Saravanan, Anastasios Drosou, Mete Ozay, Gil Ho Lee, Jungin Lee, Seokyeong Jung

TL;DR

persoDA tackles the mismatch between training and test conditions in on-device ASR by learning augmentation parameters $\theta$ from a user’s data to mirror their acoustic environment $E$. It decomposes into two personalized augmentation streams: persoNoise for user-specific background noise and persoReverb for room reverberation, each derived from unlabeled user recordings using VAD, speech separation, and T60-based RIR selection. Empirical results on LibriSpeech and VOiCES show that persoDA yields a relative WER reduction of about 13.9% over standard data augmentation and accelerates convergence by 16–20% compared with MCT. The approach requires minimal storage (a few seconds of RIRs) and demonstrates practical benefits for fast, on-device personalization with limited user data.

Abstract

Data augmentation (DA) is ubiquitously used in training of Automatic Speech Recognition (ASR) models. DA offers increased data variability, robustness and generalization against different acoustic distortions. Recently, personalization of ASR models on mobile devices has been shown to improve Word Error Rate (WER). This paper evaluates data augmentation in this context and proposes persoDA; a DA method driven by user's data utilized to personalize ASR. persoDA aims to augment training with data specifically tuned towards acoustic characteristics of the end-user, as opposed to standard augmentation based on Multi-Condition Training (MCT) that applies random reverberation and noises. Our evaluation with an ASR conformer-based baseline trained on Librispeech and personalized for VOICES shows that persoDA achieves a 13.9% relative WER reduction over using standard data augmentation (using random noise & reverberation). Furthermore, persoDA shows 16% to 20% faster convergence over MCT.

persoDA: Personalized Data Augmentation for Personalized ASR

TL;DR

persoDA tackles the mismatch between training and test conditions in on-device ASR by learning augmentation parameters from a user’s data to mirror their acoustic environment . It decomposes into two personalized augmentation streams: persoNoise for user-specific background noise and persoReverb for room reverberation, each derived from unlabeled user recordings using VAD, speech separation, and T60-based RIR selection. Empirical results on LibriSpeech and VOiCES show that persoDA yields a relative WER reduction of about 13.9% over standard data augmentation and accelerates convergence by 16–20% compared with MCT. The approach requires minimal storage (a few seconds of RIRs) and demonstrates practical benefits for fast, on-device personalization with limited user data.

Abstract

Data augmentation (DA) is ubiquitously used in training of Automatic Speech Recognition (ASR) models. DA offers increased data variability, robustness and generalization against different acoustic distortions. Recently, personalization of ASR models on mobile devices has been shown to improve Word Error Rate (WER). This paper evaluates data augmentation in this context and proposes persoDA; a DA method driven by user's data utilized to personalize ASR. persoDA aims to augment training with data specifically tuned towards acoustic characteristics of the end-user, as opposed to standard augmentation based on Multi-Condition Training (MCT) that applies random reverberation and noises. Our evaluation with an ASR conformer-based baseline trained on Librispeech and personalized for VOICES shows that persoDA achieves a 13.9% relative WER reduction over using standard data augmentation (using random noise & reverberation). Furthermore, persoDA shows 16% to 20% faster convergence over MCT.
Paper Structure (14 sections, 1 equation, 3 figures, 2 tables)

This paper contains 14 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The personalized data augmentation ( persoDA) framework. persoDA guides the training DA process to select the most adequate augmentation given user's data.
  • Figure 2: WER achieved on the set $\mathcal{V}'$ used to estimate information to guide persoDA. Experiments 'w/ labels' were trained with ground-truth transcripts and 'w/o labels' were trained with pseudo-labels.
  • Figure 3: WER achieved on the set $\mathcal{V}"$ which comprises unseen data. Experiments 'w/ labels' were trained with ground-truth transcripts and 'w/o labels' were trained with pseudo-labels.