GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task
Chutong Meng, Antonios Anastasopoulos
TL;DR
This work addresses low-resource speech translation by fine-tuning the SeamlessM4T-v2 model across ASR, MT, and end-to-end ST for 10 language pairs in IWSLT 2025. It systematically compares direct E2E ST fine-tuning, cascaded ASR+MT pipelines, multi-task KD, and initialization of ST components from ASR/MT, with a particular focus on languages not seen during pre-training. Key findings show that direct E2E fine-tuning delivers strong ST performance, in-domain ASR pretraining markedly boosts unseen languages, and multi-task or cascaded setups offer limited or mixed gains. The results provide practical guidance for deploying low-resource ST systems and point to future work in pre-training strategies and speech-language model integrations.
Abstract
This paper describes the GMU systems for the IWSLT 2025 low-resource speech translation shared task. We trained systems for all language pairs, except for Levantine Arabic. We fine-tuned SeamlessM4T-v2 for automatic speech recognition (ASR), machine translation (MT), and end-to-end speech translation (E2E ST). The ASR and MT models are also used to form cascaded ST systems. Additionally, we explored various training paradigms for E2E ST fine-tuning, including direct E2E fine-tuning, multi-task training, and parameter initialization using components from fine-tuned ASR and/or MT models. Our results show that (1) direct E2E fine-tuning yields strong results; (2) initializing with a fine-tuned ASR encoder improves ST performance on languages SeamlessM4T-v2 has not been trained on; (3) multi-task training can be slightly helpful.
