Making LLMs Better Many-to-Many Speech-to-Text Translators with Curriculum Learning
Yexing Du, Youcheng Pan, Ziyang Ma, Bo Yang, Yifan Yang, Keqi Deng, Xie Chen, Yang Xiang, Ming Liu, Bing Qin
TL;DR
<3-5 sentence high-level summary> This paper tackles the challenge of building robust many-to-many S2TT systems under limited data by reframing S2TT as an SRT problem and applying a three-stage curriculum (ASR→SMT→SRT) that leverages the MT capabilities of large language models. The authors introduce LLM-SRT, an architecture combining a frozen Whisper speech encoder, a trainable speech adapter (Q-Former + MLP), and a capable LLM, and demonstrate its effectiveness across 3 model sizes (3B, 7B, 32B) on FLEURS and CoVoST-2. They show state-of-the-art performance for low-resource directions and solid results in high-resource settings, with substantial gains in speed and data efficiency through the optimized adapter design and staged training. The work provides a scalable path toward broad, many-to-many S2TT coverage, offering practical impact for multilingual speech translation in data-scarce regimes while maintaining strong performance when data is abundant.
Abstract
Multimodal Large Language Models (MLLMs) have achieved significant success in Speech-to-Text Translation (S2TT) tasks. While most existing research has focused on English-centric translation directions, the exploration of many-to-many translation is still limited by the scarcity of parallel data. To address this, we propose a three-stage curriculum learning strategy that leverages the machine translation capabilities of large language models and adapts them to S2TT tasks, enabling effective learning in low-resource settings. We trained MLLMs with varying parameter sizes (3B, 7B, and 32B) and evaluated the proposed strategy using the FLEURS and CoVoST-2 datasets. Experimental results show that the proposed strategy achieves state-of-the-art average performance in $15\times14$ language pairs, requiring fewer than 10 hours of speech data per language to achieve competitive results. The source code and models are released at https://github.com/yxduir/LLM-SRT.
