persoDA: Personalized Data Augmentation for Personalized ASR

Pablo Peso Parada; Spyros Fontalis; Md Asif Jalal; Karthikeyan Saravanan; Anastasios Drosou; Mete Ozay; Gil Ho Lee; Jungin Lee; Seokyeong Jung

persoDA: Personalized Data Augmentation for Personalized ASR

Pablo Peso Parada, Spyros Fontalis, Md Asif Jalal, Karthikeyan Saravanan, Anastasios Drosou, Mete Ozay, Gil Ho Lee, Jungin Lee, Seokyeong Jung

TL;DR

persoDA tackles the mismatch between training and test conditions in on-device ASR by learning augmentation parameters $\theta$ from a user’s data to mirror their acoustic environment $E$. It decomposes into two personalized augmentation streams: persoNoise for user-specific background noise and persoReverb for room reverberation, each derived from unlabeled user recordings using VAD, speech separation, and T60-based RIR selection. Empirical results on LibriSpeech and VOiCES show that persoDA yields a relative WER reduction of about 13.9% over standard data augmentation and accelerates convergence by 16–20% compared with MCT. The approach requires minimal storage (a few seconds of RIRs) and demonstrates practical benefits for fast, on-device personalization with limited user data.

Abstract

Data augmentation (DA) is ubiquitously used in training of Automatic Speech Recognition (ASR) models. DA offers increased data variability, robustness and generalization against different acoustic distortions. Recently, personalization of ASR models on mobile devices has been shown to improve Word Error Rate (WER). This paper evaluates data augmentation in this context and proposes persoDA; a DA method driven by user's data utilized to personalize ASR. persoDA aims to augment training with data specifically tuned towards acoustic characteristics of the end-user, as opposed to standard augmentation based on Multi-Condition Training (MCT) that applies random reverberation and noises. Our evaluation with an ASR conformer-based baseline trained on Librispeech and personalized for VOICES shows that persoDA achieves a 13.9% relative WER reduction over using standard data augmentation (using random noise & reverberation). Furthermore, persoDA shows 16% to 20% faster convergence over MCT.

persoDA: Personalized Data Augmentation for Personalized ASR

TL;DR

persoDA tackles the mismatch between training and test conditions in on-device ASR by learning augmentation parameters

from a user’s data to mirror their acoustic environment

. It decomposes into two personalized augmentation streams: persoNoise for user-specific background noise and persoReverb for room reverberation, each derived from unlabeled user recordings using VAD, speech separation, and T60-based RIR selection. Empirical results on LibriSpeech and VOiCES show that persoDA yields a relative WER reduction of about 13.9% over standard data augmentation and accelerates convergence by 16–20% compared with MCT. The approach requires minimal storage (a few seconds of RIRs) and demonstrates practical benefits for fast, on-device personalization with limited user data.

Abstract

Paper Structure (14 sections, 1 equation, 3 figures, 2 tables)

This paper contains 14 sections, 1 equation, 3 figures, 2 tables.

Introduction
Method
persoNoise
VAD based Noise Extraction
Speech Separation based Noise Extraction
persoReverb
Evaluation
Experimental Setup
Datasets
Evaluation Metrics
Experimental Analyses
Comparison of persoDA
Evaluation on disjoint training and validation sets with and without pseudo-labels
Conclusion

Figures (3)

Figure 1: The personalized data augmentation ( persoDA) framework. persoDA guides the training DA process to select the most adequate augmentation given user's data.
Figure 2: WER achieved on the set $\mathcal{V}'$ used to estimate information to guide persoDA. Experiments 'w/ labels' were trained with ground-truth transcripts and 'w/o labels' were trained with pseudo-labels.
Figure 3: WER achieved on the set $\mathcal{V}"$ which comprises unseen data. Experiments 'w/ labels' were trained with ground-truth transcripts and 'w/o labels' were trained with pseudo-labels.

persoDA: Personalized Data Augmentation for Personalized ASR

TL;DR

Abstract

persoDA: Personalized Data Augmentation for Personalized ASR

Authors

TL;DR

Abstract

Table of Contents

Figures (3)