Table of Contents
Fetching ...

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle

TL;DR

This study systematically evaluates SpeechLLMs for speech-to-text translation against direct and cascaded baselines using Hearing to Translate, a phenomenon-aware test suite across 16 benchmarks and 13 language pairs. It finds that cascaded systems generally provide the strongest and most consistent translation quality, while SpeechLLMs show potential in noisy, code-switched, and disfluent scenarios but fail to surpass cascades in most settings. Standalone SFMs underperform compared to both cascades and SpeechLLMs, highlighting the vital role of LLMs in achieving high-quality ST. The work also analyzes gender bias and accent robustness, showing encoder-driven effects and stressing the need for more diverse training data and robust evaluation of multilingual speech. Human evaluation corroborates the trends from automatic metrics, reinforcing the study’s methodology and conclusions about the current state and actionable directions for SpeechLLMs.

Abstract

As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

TL;DR

This study systematically evaluates SpeechLLMs for speech-to-text translation against direct and cascaded baselines using Hearing to Translate, a phenomenon-aware test suite across 16 benchmarks and 13 language pairs. It finds that cascaded systems generally provide the strongest and most consistent translation quality, while SpeechLLMs show potential in noisy, code-switched, and disfluent scenarios but fail to surpass cascades in most settings. Standalone SFMs underperform compared to both cascades and SpeechLLMs, highlighting the vital role of LLMs in achieving high-quality ST. The work also analyzes gender bias and accent robustness, showing encoder-driven effects and stressing the need for more diverse training data and robust evaluation of multilingual speech. Human evaluation corroborates the trends from automatic metrics, reinforcing the study’s methodology and conclusions about the current state and actionable directions for SpeechLLMs.

Abstract

As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.

Paper Structure

This paper contains 60 sections, 2 equations, 5 figures, 26 tables.

Figures (5)

  • Figure 1: Plot showing the relationship between Gender Coreference Gap ($\Delta \text{F1}_\text{\female\male}$) and Stereotypical Gap ($\Delta \text{S}_\text{\female\male}$) across all evaluated systems.
  • Figure 2: Standard deviation of xCOMET$^\text{QE}_\text{S}$ scores for ManDi (zh-en) and CommonAccent (all other directions) across source-language accent. Numerical values for all cells can be found in Table \ref{['tab:stdev']}.
  • Figure 3: Screenshot of the Pearmut zouhar2026pearmut annotation interface together with annotation guidelines. The annotator first listens to the source audio, then scans the three model outputs where they mark error spans with severities and categories. Lastly, the annotator assigns the final scores and proceeds to the next item.
  • Figure 4: xCOMET$^\text{QE}_\text{S}$ results for language pairs into English, broken down by source-language accent. ZH-EN results come from ManDI, while all other pairs represent CommonAccent results.
  • Figure 5: CommonAccent xCOMET$^\text{QE}_\text{S}$ results for language pairs out of English, broken down by source speech accent.