Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation
Jacobo Romero-Díaz, Gerard I. Gállego, Oriol Pareras, Federico Costa, Javier Hernando, Cristina España-Bonet
TL;DR
The paper investigates whether Chain-of-Thought (CoT) prompts in Speech-to-Text Translation (S2TT) provide advantages over traditional cascaded systems by enabling access to both speech and transcription. Using a SalamandraTA 7B-based Speech LLM with discretized speech units (DSUs) derived from mHuBERT, it compares CoT and Cascade strategies across six European languages through attribution analyses, robustness to simulated transcript errors, and prosody-awareness testing with ContraProst. Results show CoT largely behaves like a cascade, relying on transcripts and exhibiting limited use of acoustic cues, and its robustness to transcription errors is not inherently superior; simple training interventions that introduce Direct S2TT data or noisy transcripts can improve CoT (and similarly Cascade) performance. The findings argue that realizing CoT advantages requires architectures that explicitly integrate acoustic information throughout translation, rather than relying on CoT prompts alone.
Abstract
Speech-to-Text Translation (S2TT) systems built from Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) modules face two major limitations: error propagation and the inability to exploit prosodic or other acoustic cues. Chain-of-Thought (CoT) prompting has recently been introduced, with the expectation that jointly accessing speech and transcription will overcome these issues. Analyzing CoT through attribution methods, robustness evaluations with corrupted transcripts, and prosody-awareness, we find that it largely mirrors cascaded behavior, relying mainly on transcripts while barely leveraging speech. Simple training interventions, such as adding Direct S2TT data or noisy transcript injection, enhance robustness and increase speech attribution. These findings challenge the assumed advantages of CoT and highlight the need for architectures that explicitly integrate acoustic information into translation.
