Table of Contents
Fetching ...

Augmenting Polish Automatic Speech Recognition System With Synthetic Data

Łukasz Bondaruk, Jakub Kubiak, Mateusz Czyżnikiewicz

TL;DR

Voicebox-based speech synthesis pipeline is described and it is shown that addition of synthetic speech to training improves achieved results significantly and also presents final results achieved by the models in the competition.

Abstract

This paper presents a system developed for submission to Poleval 2024, Task 3: Polish Automatic Speech Recognition Challenge. We describe Voicebox-based speech synthesis pipeline and utilize it to augment Conformer and Whisper speech recognition models with synthetic data. We show that addition of synthetic speech to training improves achieved results significantly. We also present final results achieved by our models in the competition.

Augmenting Polish Automatic Speech Recognition System With Synthetic Data

TL;DR

Voicebox-based speech synthesis pipeline is described and it is shown that addition of synthetic speech to training improves achieved results significantly and also presents final results achieved by the models in the competition.

Abstract

This paper presents a system developed for submission to Poleval 2024, Task 3: Polish Automatic Speech Recognition Challenge. We describe Voicebox-based speech synthesis pipeline and utilize it to augment Conformer and Whisper speech recognition models with synthetic data. We show that addition of synthetic speech to training improves achieved results significantly. We also present final results achieved by our models in the competition.

Paper Structure

This paper contains 10 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Automatic speech recognition system augmented with synthetic speech presented as a hierarchical system. In Stage I, Synthesizer is trained, its weights are then frozen in Stage II where Recognizer is trained. In Stage II, data is sampled and provided either directly to the Recognizer (yellow data flow) or first is processed with Synthesizer and only then is provided to Recognizer (red data flow).