Table of Contents
Fetching ...

Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM

Jiawei Yu, Yuang Li, Xiaosong Qiao, Huan Zhao, Xiaofeng Zhao, Wei Tang, Min Zhang, Hao Yang, Jinsong Su

TL;DR

This paper proposes Hard-Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS to clone speech styles that the ASR model finds challenging to recognize.

Abstract

Text-to-speech (TTS) models have been widely adopted to enhance automatic speech recognition (ASR) systems using text-only corpora, thereby reducing the cost of labeling real speech data. Existing research primarily utilizes additional text data and predefined speech styles supported by TTS models. In this paper, we propose Hard-Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS. Our approach employs LLMs to generate diverse in-domain text through rewriting, without relying on additional text data. Rather than using predefined speech styles, we introduce a hard prompt selection method with zero-shot TTS to clone speech styles that the ASR model finds challenging to recognize. Experiments demonstrate that Hard-Synth significantly enhances the Conformer model, achieving relative word error rate (WER) reductions of 6.5\%/4.4\% on LibriSpeech dev/test-other subsets. Additionally, we show that Hard-Synth is data-efficient and capable of reducing bias in ASR.

Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM

TL;DR

This paper proposes Hard-Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS to clone speech styles that the ASR model finds challenging to recognize.

Abstract

Text-to-speech (TTS) models have been widely adopted to enhance automatic speech recognition (ASR) systems using text-only corpora, thereby reducing the cost of labeling real speech data. Existing research primarily utilizes additional text data and predefined speech styles supported by TTS models. In this paper, we propose Hard-Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS. Our approach employs LLMs to generate diverse in-domain text through rewriting, without relying on additional text data. Rather than using predefined speech styles, we introduce a hard prompt selection method with zero-shot TTS to clone speech styles that the ASR model finds challenging to recognize. Experiments demonstrate that Hard-Synth significantly enhances the Conformer model, achieving relative word error rate (WER) reductions of 6.5\%/4.4\% on LibriSpeech dev/test-other subsets. Additionally, we show that Hard-Synth is data-efficient and capable of reducing bias in ASR.

Paper Structure

This paper contains 21 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: System diagram of Hard-Synth. (a) Zero-shot TTS is employed with hard audio prompts ($\mathbf{X}'$) and text generated by LLM ($\mathbf{Y}_2$) to produce synthetic speech signals ($\mathbf{\hat{X}}$). (b) Hard audio prompts are defined as speech samples that a weak ASR model finds difficult to recognize. (c) Before training, the synthetic speech signals are filtered based on CER. (d) Real and synthetic audio samples are then combined for the final training phase.
  • Figure 2: Statistics of the synthetic dataset.
  • Figure 3: Visualizations of spectrograms. VoiceCraft excels in replicating the noise characteristics of the audio prompt compared to F5-TTS.