Table of Contents
Fetching ...

GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task

Chutong Meng, Antonios Anastasopoulos

TL;DR

This work addresses low-resource speech translation by fine-tuning the SeamlessM4T-v2 model across ASR, MT, and end-to-end ST for 10 language pairs in IWSLT 2025. It systematically compares direct E2E ST fine-tuning, cascaded ASR+MT pipelines, multi-task KD, and initialization of ST components from ASR/MT, with a particular focus on languages not seen during pre-training. Key findings show that direct E2E fine-tuning delivers strong ST performance, in-domain ASR pretraining markedly boosts unseen languages, and multi-task or cascaded setups offer limited or mixed gains. The results provide practical guidance for deploying low-resource ST systems and point to future work in pre-training strategies and speech-language model integrations.

Abstract

This paper describes the GMU systems for the IWSLT 2025 low-resource speech translation shared task. We trained systems for all language pairs, except for Levantine Arabic. We fine-tuned SeamlessM4T-v2 for automatic speech recognition (ASR), machine translation (MT), and end-to-end speech translation (E2E ST). The ASR and MT models are also used to form cascaded ST systems. Additionally, we explored various training paradigms for E2E ST fine-tuning, including direct E2E fine-tuning, multi-task training, and parameter initialization using components from fine-tuned ASR and/or MT models. Our results show that (1) direct E2E fine-tuning yields strong results; (2) initializing with a fine-tuned ASR encoder improves ST performance on languages SeamlessM4T-v2 has not been trained on; (3) multi-task training can be slightly helpful.

GMU Systems for the IWSLT 2025 Low-Resource Speech Translation Shared Task

TL;DR

This work addresses low-resource speech translation by fine-tuning the SeamlessM4T-v2 model across ASR, MT, and end-to-end ST for 10 language pairs in IWSLT 2025. It systematically compares direct E2E ST fine-tuning, cascaded ASR+MT pipelines, multi-task KD, and initialization of ST components from ASR/MT, with a particular focus on languages not seen during pre-training. Key findings show that direct E2E fine-tuning delivers strong ST performance, in-domain ASR pretraining markedly boosts unseen languages, and multi-task or cascaded setups offer limited or mixed gains. The results provide practical guidance for deploying low-resource ST systems and point to future work in pre-training strategies and speech-language model integrations.

Abstract

This paper describes the GMU systems for the IWSLT 2025 low-resource speech translation shared task. We trained systems for all language pairs, except for Levantine Arabic. We fine-tuned SeamlessM4T-v2 for automatic speech recognition (ASR), machine translation (MT), and end-to-end speech translation (E2E ST). The ASR and MT models are also used to form cascaded ST systems. Additionally, we explored various training paradigms for E2E ST fine-tuning, including direct E2E fine-tuning, multi-task training, and parameter initialization using components from fine-tuned ASR and/or MT models. Our results show that (1) direct E2E fine-tuning yields strong results; (2) initializing with a fine-tuned ASR encoder improves ST performance on languages SeamlessM4T-v2 has not been trained on; (3) multi-task training can be slightly helpful.

Paper Structure

This paper contains 19 sections, 6 equations, 1 figure, 11 tables.

Figures (1)

  • Figure 1: Illustration of our SeamlessM4T-v2 fine-tuning strategies. Speech Encoder, Text Encoder, and Text Decoder refer to the corresponding components of SeamlessM4T-v2.