Table of Contents
Fetching ...

Making LLMs Better Many-to-Many Speech-to-Text Translators with Curriculum Learning

Yexing Du, Youcheng Pan, Ziyang Ma, Bo Yang, Yifan Yang, Keqi Deng, Xie Chen, Yang Xiang, Ming Liu, Bing Qin

TL;DR

<3-5 sentence high-level summary> This paper tackles the challenge of building robust many-to-many S2TT systems under limited data by reframing S2TT as an SRT problem and applying a three-stage curriculum (ASR→SMT→SRT) that leverages the MT capabilities of large language models. The authors introduce LLM-SRT, an architecture combining a frozen Whisper speech encoder, a trainable speech adapter (Q-Former + MLP), and a capable LLM, and demonstrate its effectiveness across 3 model sizes (3B, 7B, 32B) on FLEURS and CoVoST-2. They show state-of-the-art performance for low-resource directions and solid results in high-resource settings, with substantial gains in speed and data efficiency through the optimized adapter design and staged training. The work provides a scalable path toward broad, many-to-many S2TT coverage, offering practical impact for multilingual speech translation in data-scarce regimes while maintaining strong performance when data is abundant.

Abstract

Multimodal Large Language Models (MLLMs) have achieved significant success in Speech-to-Text Translation (S2TT) tasks. While most existing research has focused on English-centric translation directions, the exploration of many-to-many translation is still limited by the scarcity of parallel data. To address this, we propose a three-stage curriculum learning strategy that leverages the machine translation capabilities of large language models and adapts them to S2TT tasks, enabling effective learning in low-resource settings. We trained MLLMs with varying parameter sizes (3B, 7B, and 32B) and evaluated the proposed strategy using the FLEURS and CoVoST-2 datasets. Experimental results show that the proposed strategy achieves state-of-the-art average performance in $15\times14$ language pairs, requiring fewer than 10 hours of speech data per language to achieve competitive results. The source code and models are released at https://github.com/yxduir/LLM-SRT.

Making LLMs Better Many-to-Many Speech-to-Text Translators with Curriculum Learning

TL;DR

<3-5 sentence high-level summary> This paper tackles the challenge of building robust many-to-many S2TT systems under limited data by reframing S2TT as an SRT problem and applying a three-stage curriculum (ASR→SMT→SRT) that leverages the MT capabilities of large language models. The authors introduce LLM-SRT, an architecture combining a frozen Whisper speech encoder, a trainable speech adapter (Q-Former + MLP), and a capable LLM, and demonstrate its effectiveness across 3 model sizes (3B, 7B, 32B) on FLEURS and CoVoST-2. They show state-of-the-art performance for low-resource directions and solid results in high-resource settings, with substantial gains in speed and data efficiency through the optimized adapter design and staged training. The work provides a scalable path toward broad, many-to-many S2TT coverage, offering practical impact for multilingual speech translation in data-scarce regimes while maintaining strong performance when data is abundant.

Abstract

Multimodal Large Language Models (MLLMs) have achieved significant success in Speech-to-Text Translation (S2TT) tasks. While most existing research has focused on English-centric translation directions, the exploration of many-to-many translation is still limited by the scarcity of parallel data. To address this, we propose a three-stage curriculum learning strategy that leverages the machine translation capabilities of large language models and adapts them to S2TT tasks, enabling effective learning in low-resource settings. We trained MLLMs with varying parameter sizes (3B, 7B, and 32B) and evaluated the proposed strategy using the FLEURS and CoVoST-2 datasets. Experimental results show that the proposed strategy achieves state-of-the-art average performance in language pairs, requiring fewer than 10 hours of speech data per language to achieve competitive results. The source code and models are released at https://github.com/yxduir/LLM-SRT.
Paper Structure (44 sections, 4 equations, 3 figures, 18 tables)

This paper contains 44 sections, 4 equations, 3 figures, 18 tables.

Figures (3)

  • Figure 1: Comparison of S2TT Methods. (a) adopts a cascaded system; (b) directly generates translated text; (c) generates both transcription and translation text in an end-to-end process, with <|eng|><|zho|> indicating transcribing English and translating it into Chinese.
  • Figure 2: The Architecture of LLM-SRT. LLM-SRT consists of a speech encoder, speech adapter, and LLM. A three-stage curriculum learning strategy sequentially trains the ASR, SMT, and SRT tasks, as shown in Table \ref{['pattern']}. In stages 1 and 2, the speech adapter is continuously trained to enable efficient fine-tuning. In stage 3, the LLM is additionally unfrozen, while the speech adapter continues to be trained.
  • Figure 3: BLEU Scores for 15×14 Directions: Comparison between MT and S2TT. The results show a strong correlation, suggesting that our S2TT capability is derived from the MT model. Table \ref{['tab:error']} includes an error analysis showing that S2TT outperforms MT.