Table of Contents
Fetching ...

Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation

Jacobo Romero-Díaz, Gerard I. Gállego, Oriol Pareras, Federico Costa, Javier Hernando, Cristina España-Bonet

TL;DR

The paper investigates whether Chain-of-Thought (CoT) prompts in Speech-to-Text Translation (S2TT) provide advantages over traditional cascaded systems by enabling access to both speech and transcription. Using a SalamandraTA 7B-based Speech LLM with discretized speech units (DSUs) derived from mHuBERT, it compares CoT and Cascade strategies across six European languages through attribution analyses, robustness to simulated transcript errors, and prosody-awareness testing with ContraProst. Results show CoT largely behaves like a cascade, relying on transcripts and exhibiting limited use of acoustic cues, and its robustness to transcription errors is not inherently superior; simple training interventions that introduce Direct S2TT data or noisy transcripts can improve CoT (and similarly Cascade) performance. The findings argue that realizing CoT advantages requires architectures that explicitly integrate acoustic information throughout translation, rather than relying on CoT prompts alone.

Abstract

Speech-to-Text Translation (S2TT) systems built from Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) modules face two major limitations: error propagation and the inability to exploit prosodic or other acoustic cues. Chain-of-Thought (CoT) prompting has recently been introduced, with the expectation that jointly accessing speech and transcription will overcome these issues. Analyzing CoT through attribution methods, robustness evaluations with corrupted transcripts, and prosody-awareness, we find that it largely mirrors cascaded behavior, relying mainly on transcripts while barely leveraging speech. Simple training interventions, such as adding Direct S2TT data or noisy transcript injection, enhance robustness and increase speech attribution. These findings challenge the assumed advantages of CoT and highlight the need for architectures that explicitly integrate acoustic information into translation.

Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation

TL;DR

The paper investigates whether Chain-of-Thought (CoT) prompts in Speech-to-Text Translation (S2TT) provide advantages over traditional cascaded systems by enabling access to both speech and transcription. Using a SalamandraTA 7B-based Speech LLM with discretized speech units (DSUs) derived from mHuBERT, it compares CoT and Cascade strategies across six European languages through attribution analyses, robustness to simulated transcript errors, and prosody-awareness testing with ContraProst. Results show CoT largely behaves like a cascade, relying on transcripts and exhibiting limited use of acoustic cues, and its robustness to transcription errors is not inherently superior; simple training interventions that introduce Direct S2TT data or noisy transcripts can improve CoT (and similarly Cascade) performance. The findings argue that realizing CoT advantages requires architectures that explicitly integrate acoustic information throughout translation, rather than relying on CoT prompts alone.

Abstract

Speech-to-Text Translation (S2TT) systems built from Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) modules face two major limitations: error propagation and the inability to exploit prosodic or other acoustic cues. Chain-of-Thought (CoT) prompting has recently been introduced, with the expectation that jointly accessing speech and transcription will overcome these issues. Analyzing CoT through attribution methods, robustness evaluations with corrupted transcripts, and prosody-awareness, we find that it largely mirrors cascaded behavior, relying mainly on transcripts while barely leveraging speech. Simple training interventions, such as adding Direct S2TT data or noisy transcript injection, enhance robustness and increase speech attribution. These findings challenge the assumed advantages of CoT and highlight the need for architectures that explicitly integrate acoustic information into translation.

Paper Structure

This paper contains 13 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Example of corrupted transcription generated with Gemini-2.0-Flash (corruption ratio: 15%).
  • Figure 2: Layer-wise attribution scores obtained with Value Zeroing, aggregated over Speech, Transcription, and Translation tokens. Each subfigure shows one model variant (Base, Dual, Noisy). Contributions from special tokens are omitted for clarity. Shaded areas indicate mean $\pm$ std across language pairs.
  • Figure 3: Robustness to error propagation under controlled transcript corruption. Performance drop ($\Delta$xcomet) is measured when noisy transcripts are injected into the CoT prompt, relative to ground-truth transcripts (0% noise percentage). Each panel corresponds to one model variant (Base, Dual, Noisy), with curves comparing CoT and Cascade inference. Results are averaged across languages, and shaded areas indicate mean $\pm$ std.