Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition
Wesley Bian, Xiaofeng Lin, Guang Cheng
TL;DR
The paper tackles data scarcity and linguistic bias in automatic speech recognition by introducing LatentVoiceMix, a latent-space mixup approach within a style-encoder of a diffusion-based voice-conversion model. By mixing speaker timbres in the latent space while preserving linguistic content, the method expands acoustic diversity without collecting new data. Empirical results across Wolof, English, and multiple ASR architectures show consistent improvements in WER over traditional augmentation methods, with notable gains for low-resource languages and fairer multilingual performance. The analysis includes ablations and timbre-distribution studies, underscoring the method's effectiveness and its potential to broaden access to advanced speech technologies.
Abstract
Modern machine learning models for audio tasks often exhibit superior performance on English and other well-resourced languages, primarily due to the abundance of available training data. This disparity leads to an unfair performance gap for low-resource languages, where data collection is both challenging and costly. In this work, we introduce a novel data augmentation technique for speech corpora designed to mitigate this gap. Through comprehensive experiments, we demonstrate that our method significantly improves the performance of automatic speech recognition systems on low-resource languages. Furthermore, we show that our approach outperforms existing augmentation strategies, offering a practical solution for enhancing speech technology in underrepresented linguistic communities.
