LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

Xi Chen; Songyang Zhang; Qibing Bai; Kai Chen; Satoshi Nakamura

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

Xi Chen, Songyang Zhang, Qibing Bai, Kai Chen, Satoshi Nakamura

TL;DR

This work investigates LLaST, an LLM-based end-to-end speech translation framework that fuses a speech encoder, a lightweight adaptor, and a decoder-only LLM. It introduces dual-LoRA fine-tuning, ASR-augmentation, and multilingual data augmentation to achieve scalable, high-performance ST, demonstrated by a state-of-the-art 45.1 BLEU on CoVoST-2 Fr→En and strong results across multiple language pairs. Key findings show that Whisper-based encoders and larger LLMs yield substantial gains, with encoder scaling often delivering greater parameter efficiency than decoder scaling. The approach and open release of data, code, and models aim to establish a robust baseline and guide future research in LLM-driven speech translation.

Abstract

We introduces LLaST, a framework for building high-performance Large Language model based Speech-to-text Translation systems. We address the limitations of end-to-end speech translation(E2E ST) models by exploring model architecture design and optimization techniques tailored for LLMs. Our approach includes LLM-based speech translation architecture design, ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization. Our approach demonstrates superior performance on the CoVoST-2 benchmark and showcases exceptional scaling capabilities powered by LLMs. We believe this effective method will serve as a strong baseline for speech translation and provide insights for future improvements of the LLM-based speech translation framework. We release the data, code and models in https://github.com/openaudiolab/LLaST.

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

TL;DR

Abstract

Paper Structure (36 sections, 7 equations, 4 figures, 6 tables)

This paper contains 36 sections, 7 equations, 4 figures, 6 tables.

Introduction
Related Work
Cascaded Speech Translation
End-to-End Speech Translation
LLM-based Speech Translation
Method
Problem Setting
Model Architecture
Speech Encoder
Adaptor
Large Language Model
Training and Inference
Optimization with Dual-LoRA Fintuning
Training with ASR-augmentation
Inference Methodology
...and 21 more sections

Figures (4)

Figure 1: Model Architecture of LLaST We introduce dual-LoRA in the optimization, and keep weights of the speech encoder and LLM frozen. We use a 3-layer MPLs for adaptor and fine-tune its parameters together with dual-LoRA.
Figure 2: An example for training data.
Figure 3: Influence of different language models. We use Whisper-large-v2 as speech encoder and report SacreBLEU scores on CoVoST-2 test set for all experiments.
Figure 4: Influence of different LLMs and ASR-augmentation. We report SacreBLEU scores on CoVoST-2 test set for all experiments.

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

TL;DR

Abstract

LLaST: Improved End-to-end Speech Translation System Leveraged by Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)