Table of Contents
Fetching ...

KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization

Zhaolin Li, Yining Liu, Danni Liu, Tuan Nam Nguyen, Enes Yavuz Ugan, Tu Anh Dinh, Carlos Mullov, Alexander Waibel, Jan Niehues

TL;DR

This work tackles low-resource speech translation for bem, apc, and aeb to English by leveraging unconstrained data to train both cascaded ASR+MT and end-to-end ST systems. It systematically investigates synthetic data augmentation via MT-augmented ST and TTS-augmented ST, along with intra-distillation regularization, and uses MBR decoding to fuse system outputs. The results show that high-quality synthetic data can boost ASR, MT, and ST performance, and that regularization provides robust gains across tasks and pre-trained models; however language-specific effects arise, especially for dialectal Arabic. The study demonstrates practical strategies for improving low-resource ST and reports a ~1.5 BLEU MBR gain when combining cascaded and E2E systems.

Abstract

This paper presents KIT's submissions to the IWSLT 2025 low-resource track. We develop both cascaded systems, consisting of Automatic Speech Recognition (ASR) and Machine Translation (MT) models, and end-to-end (E2E) Speech Translation (ST) systems for three language pairs: Bemba, North Levantine Arabic, and Tunisian Arabic into English. Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently. This study further explores system enhancement with synthetic data and model regularization. Specifically, we investigate MT-augmented ST by generating translations from ASR data using MT models. For North Levantine, which lacks parallel ST training data, a system trained solely on synthetic data slightly surpasses the cascaded system trained on real data. We also explore augmentation using text-to-speech models by generating synthetic speech from MT data, demonstrating the benefits of synthetic data in improving both ASR and ST performance for Bemba. Additionally, we apply intra-distillation to enhance model performance. Our experiments show that this approach consistently improves results across ASR, MT, and ST tasks, as well as across different pre-trained models. Finally, we apply Minimum Bayes Risk decoding to combine the cascaded and end-to-end systems, achieving an improvement of approximately 1.5 BLEU points.

KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization

TL;DR

This work tackles low-resource speech translation for bem, apc, and aeb to English by leveraging unconstrained data to train both cascaded ASR+MT and end-to-end ST systems. It systematically investigates synthetic data augmentation via MT-augmented ST and TTS-augmented ST, along with intra-distillation regularization, and uses MBR decoding to fuse system outputs. The results show that high-quality synthetic data can boost ASR, MT, and ST performance, and that regularization provides robust gains across tasks and pre-trained models; however language-specific effects arise, especially for dialectal Arabic. The study demonstrates practical strategies for improving low-resource ST and reports a ~1.5 BLEU MBR gain when combining cascaded and E2E systems.

Abstract

This paper presents KIT's submissions to the IWSLT 2025 low-resource track. We develop both cascaded systems, consisting of Automatic Speech Recognition (ASR) and Machine Translation (MT) models, and end-to-end (E2E) Speech Translation (ST) systems for three language pairs: Bemba, North Levantine Arabic, and Tunisian Arabic into English. Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently. This study further explores system enhancement with synthetic data and model regularization. Specifically, we investigate MT-augmented ST by generating translations from ASR data using MT models. For North Levantine, which lacks parallel ST training data, a system trained solely on synthetic data slightly surpasses the cascaded system trained on real data. We also explore augmentation using text-to-speech models by generating synthetic speech from MT data, demonstrating the benefits of synthetic data in improving both ASR and ST performance for Bemba. Additionally, we apply intra-distillation to enhance model performance. Our experiments show that this approach consistently improves results across ASR, MT, and ST tasks, as well as across different pre-trained models. Finally, we apply Minimum Bayes Risk decoding to combine the cascaded and end-to-end systems, achieving an improvement of approximately 1.5 BLEU points.
Paper Structure (28 sections, 1 equation, 9 tables)