Table of Contents
Fetching ...

An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR

Sewade Ogun, Vincent Colotte, Emmanuel Vincent

TL;DR

The paper tackles the challenge of distributional gaps between real and synthetic speech in ASR data augmentation. It leverages flow-based TTS/VC models to create diverse synthetic utterances and systematically evaluates the impact of activating different speech attributes—phonetic content, phoneme duration, pitch, speaker diversity, and environmental conditions—on ASR performance, using Conformer-Transducer and wav2vec2 models on Common Voice and LibriSpeech. Key findings show that phonetic content, speaker diversity, duration diversity, and environmental noise augmentations can substantially reduce WER, while pitch diversity and VC-based speaker changes are less beneficial; combining the effective attributes yields the largest gains, particularly for wav2vec2 where up to 35% relative WER reductions are observed on LibriSpeech. The results underscore the importance of controlling diversity in synthetic data and highlight the robustness of self-supervised ASR models to synthetic augmentation, while also noting saturation effects at high synthetic data volumes.

Abstract

Augmenting the training data of automatic speech recognition (ASR) systems with synthetic data generated by text-to-speech (TTS) or voice conversion (VC) has gained popularity in recent years. Several works have demonstrated improvements in ASR performance using this augmentation approach. However, because of the lower diversity of synthetic speech, naively combining synthetic and real data often does not yield the best results. In this work, we leverage recently proposed flow-based TTS/VC models allowing greater speech diversity, and assess the respective impact of augmenting various speech attributes on the word error rate (WER) achieved by several ASR models. Pitch augmentation and VC-based speaker augmentation are found to be ineffective in our setup. Jointly augmenting all other attributes reduces the WER of a Conformer-Transducer model by 11\% relative on Common Voice and by up to 35\% relative on LibriSpeech compared to training on real data only.

An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR

TL;DR

The paper tackles the challenge of distributional gaps between real and synthetic speech in ASR data augmentation. It leverages flow-based TTS/VC models to create diverse synthetic utterances and systematically evaluates the impact of activating different speech attributes—phonetic content, phoneme duration, pitch, speaker diversity, and environmental conditions—on ASR performance, using Conformer-Transducer and wav2vec2 models on Common Voice and LibriSpeech. Key findings show that phonetic content, speaker diversity, duration diversity, and environmental noise augmentations can substantially reduce WER, while pitch diversity and VC-based speaker changes are less beneficial; combining the effective attributes yields the largest gains, particularly for wav2vec2 where up to 35% relative WER reductions are observed on LibriSpeech. The results underscore the importance of controlling diversity in synthetic data and highlight the robustness of self-supervised ASR models to synthetic augmentation, while also noting saturation effects at high synthetic data volumes.

Abstract

Augmenting the training data of automatic speech recognition (ASR) systems with synthetic data generated by text-to-speech (TTS) or voice conversion (VC) has gained popularity in recent years. Several works have demonstrated improvements in ASR performance using this augmentation approach. However, because of the lower diversity of synthetic speech, naively combining synthetic and real data often does not yield the best results. In this work, we leverage recently proposed flow-based TTS/VC models allowing greater speech diversity, and assess the respective impact of augmenting various speech attributes on the word error rate (WER) achieved by several ASR models. Pitch augmentation and VC-based speaker augmentation are found to be ineffective in our setup. Jointly augmenting all other attributes reduces the WER of a Conformer-Transducer model by 11\% relative on Common Voice and by up to 35\% relative on LibriSpeech compared to training on real data only.

Paper Structure

This paper contains 33 sections, 5 equations, 5 figures, 12 tables, 2 algorithms.

Figures (5)

  • Figure 1: GlowTTS-STDP architecture during inference ogun23_interspeech.
  • Figure 2: Flow-based VC showing the Mel-spectrogram inversion and voice conversion (VC) processes.
  • Figure 3: KL divergence between natural/uniform distribution of texts and combination of real speech texts and newly selected 50 h / 100 h equivalent of TTS data.
  • Figure 4: UMAP plot of speaker embeddings of real data speakers and speakers selected using the speaker selection algorithm.
  • Figure 5: UMAP plot of speaker embeddings of real data speakers and $K$, $2K$, $4K$ or $8K$ randomly selected speakers not in the real dataset with $K=2{,}457$.