Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

Alexandre R. Ferreira; Cláudio E. C. Campelo

Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

Alexandre R. Ferreira, Cláudio E. C. Campelo

TL;DR

The paper tackles the data scarcity challenge in training robust ASR models for non-English accents by proposing a deepfake audio–based augmentation framework that leverages a few-shot voice-cloning system and a DeepSpeech transcriptor. It validates the approach on an Indian English dataset (NPTEL Pure-Set) and reports two experiments: one focusing on augmentation-driven transcriptor training and another on retraining the voice cloner. Across experiments, the augmented data did not improve transcription quality, with WER increasing due to the low quality of cloned audio, underscoring the importance of audio realism and speaker-identity information for encoder training. The work highlights practical limitations of current deepfake audio for ASR augmentation and suggests avenues for improvement, including higher-quality cloning models and cleaner, speaker-annotated English datasets.

Abstract

To train transcriptor models that produce robust results, a large and diverse labeled dataset is required. Finding such data with the necessary characteristics is a challenging task, especially for languages less popular than English. Moreover, producing such data requires significant effort and often money. Therefore, a strategy to mitigate this problem is the use of data augmentation techniques. In this work, we propose a framework that approaches data augmentation based on deepfake audio. To validate the produced framework, experiments were conducted using existing deepfake and transcription models. A voice cloner and a dataset produced by Indians (in English) were selected, ensuring the presence of a single accent in the dataset. Subsequently, the augmented data was used to train speech to text models in various scenarios.

Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

TL;DR

Abstract

Paper Structure (16 sections, 6 figures, 7 tables)

This paper contains 16 sections, 6 figures, 7 tables.

Introduction
Related Work
Theoretical Foundation
Voice Cloning
Transcriptor
Methodology
Dataset
Data Preprocessing
Voice Cloner Training
Audios Generation
Training the Transcriptor
Inferences in the Transcriptor
Results and Discussions
Experiment 1
Experiment 2
...and 1 more sections

Figures (6)

Figure 1: Voice Cloner Architecture (Real-Time Voice Cloning)
Figure 2: Illustration of a step in the process of generating new audios
Figure 3: Illustration of the step-by-step performed in Experiment 1
Figure 4: Illustration of the step-by-step performed in Experiment 2
Figure 5: Qualitative analysis of the retrained models
...and 1 more figures

Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

TL;DR

Abstract

Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)