Table of Contents
Fetching ...

End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs

Nam Luu, Ondřej Bojar

TL;DR

This work proposes an end-to-end architecture that fuses pre-trained speech encoders with Large Language Models to perform ASR and ST jointly, leveraging 4-bit QLoRA fine-tuning. By compressing speech representations via a length adapter and mapping to LLM embeddings, the model generates both transcripts and translations from a single prompt. On English→German MuST-C data, the best end-to-end configuration often beats SeamlessM4T and can match cascaded Whisper+NLLB on several metrics, with up to $8\%$ improvement in $\text{COMET}^{\text{DA}}_{22}$. Nevertheless, cascaded systems remain the strongest baseline overall, highlighting the need for further data, efficiency improvements, and architectural refinements for end-to-end speech translation.

Abstract

Speech Translation (ST) is a machine translation task that involves converting speech signals from one language to the corresponding text in another language; this task has two different approaches, namely the traditional cascade and the more recent end-to-end. This paper explores a combined end-to-end architecture of pre-trained speech encoders and Large Language Models (LLMs) for performing both Automatic Speech Recognition (ASR) and ST simultaneously. Experiments with the English-to-German language pair show that our best model not only can achieve better translation results than SeamlessM4T, a large foundational end-to-end, multi-modal translation model, but can also match the performance of a cascaded system with Whisper and NLLB, with up to a score gain of 8% in $\text{COMET}^{\text{DA}}_{22}$ metric.

End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs

TL;DR

This work proposes an end-to-end architecture that fuses pre-trained speech encoders with Large Language Models to perform ASR and ST jointly, leveraging 4-bit QLoRA fine-tuning. By compressing speech representations via a length adapter and mapping to LLM embeddings, the model generates both transcripts and translations from a single prompt. On English→German MuST-C data, the best end-to-end configuration often beats SeamlessM4T and can match cascaded Whisper+NLLB on several metrics, with up to improvement in . Nevertheless, cascaded systems remain the strongest baseline overall, highlighting the need for further data, efficiency improvements, and architectural refinements for end-to-end speech translation.

Abstract

Speech Translation (ST) is a machine translation task that involves converting speech signals from one language to the corresponding text in another language; this task has two different approaches, namely the traditional cascade and the more recent end-to-end. This paper explores a combined end-to-end architecture of pre-trained speech encoders and Large Language Models (LLMs) for performing both Automatic Speech Recognition (ASR) and ST simultaneously. Experiments with the English-to-German language pair show that our best model not only can achieve better translation results than SeamlessM4T, a large foundational end-to-end, multi-modal translation model, but can also match the performance of a cascaded system with Whisper and NLLB, with up to a score gain of 8% in metric.

Paper Structure

This paper contains 17 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The overall architecture includes a frozen speech encoder component, an adapter, and a fine-tuned LLM. The adapter can be frozen or trainable depending on the adapter type. Red arrows denote the usage of tokens during training, and blue arrows indicate tokens generated during inference; while black arrows represent the prompt fed to the LLM.
  • Figure 2: Details of different adapters
  • Figure 3: Training loss of models