Leveraging Synthetic Audio Data for End-to-End Low-Resource Speech Translation
Yasmin Moslem
TL;DR
The paper tackles data scarcity in low-resource Irish-to-English end-to-end speech translation by fine-tuning Whisper models with a mix of authentic and synthetic audio data. It introduces speech back-translation and three synthetic audio corpora generated from Tatoeba, Wikimedia, and EUbookshop texts, complemented by audio-signal processing augmentations such as VAD and noise addition. Empirical results show that augmenting authentic data with synthetic audio substantially improves translation quality, with the best performance achieved when three synthetic datasets are used (Model B++). The findings highlight the viability of synthetic audio data for low-resource speech translation and offer practical guidance on augmentation strategies and training choices.
Abstract
This paper describes our system submission to the International Conference on Spoken Language Translation (IWSLT 2024) for Irish-to-English speech translation. We built end-to-end systems based on Whisper, and employed a number of data augmentation techniques, such as speech back-translation and noise augmentation. We investigate the effect of using synthetic audio data and discuss several methods for enriching signal diversity.
