Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?
Marco Gaido, Sara Papi, Matteo Negri, Luisa Bentivogli
TL;DR
This paper surveys the emerging field of speech-to-text translation (ST) that combines Speech Foundation Models (SFMs) with Large Language Models (LLMs). It proposes a unified five-block architectural abstraction—SFM, Length Adapter, Modality Adapter, Prompt-Speech Mixer, and LLM—to organize the diverse ST solutions and analyzes 9 public works, highlighting substantial heterogeneity in data, tasks, fine-tuning, and evaluation. The authors argue that the lack of open training settings and standardized evaluation hinders fair comparison and progress, and they outline concrete recommendations for open data standards, open benchmarks, and more granular comparisons with standard ST approaches, as well as deeper study of in-context learning transfer. By systematizing current approaches and identifying key gaps, the paper aims to steer future research toward more reproducible, fair, and insightful progress in SFM+LLM-based ST.
Abstract
The field of natural language processing (NLP) has recently witnessed a transformative shift with the emergence of foundation models, particularly Large Language Models (LLMs) that have revolutionized text-based NLP. This paradigm has extended to other modalities, including speech, where researchers are actively exploring the combination of Speech Foundation Models (SFMs) and LLMs into single, unified models capable of addressing multimodal tasks. Among such tasks, this paper focuses on speech-to-text translation (ST). By examining the published papers on the topic, we propose a unified view of the architectural solutions and training strategies presented so far, highlighting similarities and differences among them. Based on this examination, we not only organize the lessons learned but also show how diverse settings and evaluation approaches hinder the identification of the best-performing solution for each architectural building block and training choice. Lastly, we outline recommendations for future works on the topic aimed at better understanding the strengths and weaknesses of the SFM+LLM solutions for ST.
