Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, Gerard I. Gállego, Jorge Iranzo-Sánchez, Ahrii Kim, Dominik Macháček, Patricia Schmidtova, Maike Züfle
TL;DR
This study systematically evaluates SpeechLLMs for speech-to-text translation against direct and cascaded baselines using Hearing to Translate, a phenomenon-aware test suite across 16 benchmarks and 13 language pairs. It finds that cascaded systems generally provide the strongest and most consistent translation quality, while SpeechLLMs show potential in noisy, code-switched, and disfluent scenarios but fail to surpass cascades in most settings. Standalone SFMs underperform compared to both cascades and SpeechLLMs, highlighting the vital role of LLMs in achieving high-quality ST. The work also analyzes gender bias and accent robustness, showing encoder-driven effects and stressing the need for more diverse training data and robust evaluation of multilingual speech. Human evaluation corroborates the trends from automatic metrics, reinforcing the study’s methodology and conclusions about the current state and actionable directions for SpeechLLMs.
Abstract
As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.
