LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models
Xi Chen, Songyang Zhang, Qibing Bai, Kai Chen, Satoshi Nakamura
TL;DR
This work investigates LLaST, an LLM-based end-to-end speech translation framework that fuses a speech encoder, a lightweight adaptor, and a decoder-only LLM. It introduces dual-LoRA fine-tuning, ASR-augmentation, and multilingual data augmentation to achieve scalable, high-performance ST, demonstrated by a state-of-the-art 45.1 BLEU on CoVoST-2 Fr→En and strong results across multiple language pairs. Key findings show that Whisper-based encoders and larger LLMs yield substantial gains, with encoder scaling often delivering greater parameter efficiency than decoder scaling. The approach and open release of data, code, and models aim to establish a robust baseline and guide future research in LLM-driven speech translation.
Abstract
We introduces LLaST, a framework for building high-performance Large Language model based Speech-to-text Translation systems. We address the limitations of end-to-end speech translation(E2E ST) models by exploring model architecture design and optimization techniques tailored for LLMs. Our approach includes LLM-based speech translation architecture design, ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization. Our approach demonstrates superior performance on the CoVoST-2 benchmark and showcases exceptional scaling capabilities powered by LLMs. We believe this effective method will serve as a strong baseline for speech translation and provide insights for future improvements of the LLM-based speech translation framework. We release the data, code and models in https://github.com/openaudiolab/LLaST.
