Table of Contents
Fetching ...

The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?

Jayadev Billa

TL;DR

The paper addresses whether end-to-end speech LLMs offer genuine architectural advantages over traditional ASR→LLM cascades. It introduces the Cascade Equivalence Hypothesis and uses matched-backbone behavioral testing, coupled with mechanistic probes (probing, logit lens, LEACE), to separate architectural effects from backbone differences. Findings reveal a spectrum of cascade equivalence: Ultravox closely mirrors its matched cascade on text-sufficient tasks, while Qwen2-Audio shows genuine architectural divergence; LEACE confirms text representations are causally necessary, and noise tests show cascades are more robust under deterioration. These results inform benchmarking and deployment, suggesting cascades remain preferable for text-sufficient tasks in clean conditions, while genuine end-to-end advantages require objective-driven training to exploit acoustic signals.

Abstract

Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for the first time. Ultravox is statistically indistinguishable from its matched cascade ($κ{=}0.93$); logit lens reveals literal text emerging in hidden states; LEACE concept erasure confirms text representations are causally necessary in both architectures tested, collapsing accuracy to near-zero. Qwen2-Audio genuinely diverges, revealing cascade equivalence is architecture-dependent, not universal. For most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0 dB.

The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?

TL;DR

The paper addresses whether end-to-end speech LLMs offer genuine architectural advantages over traditional ASR→LLM cascades. It introduces the Cascade Equivalence Hypothesis and uses matched-backbone behavioral testing, coupled with mechanistic probes (probing, logit lens, LEACE), to separate architectural effects from backbone differences. Findings reveal a spectrum of cascade equivalence: Ultravox closely mirrors its matched cascade on text-sufficient tasks, while Qwen2-Audio shows genuine architectural divergence; LEACE confirms text representations are causally necessary, and noise tests show cascades are more robust under deterioration. These results inform benchmarking and deployment, suggesting cascades remain preferable for text-sufficient tasks in clean conditions, while genuine end-to-end advantages require objective-driven training to exploit acoustic signals.

Abstract

Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple WhisperLLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for the first time. Ultravox is statistically indistinguishable from its matched cascade (); logit lens reveals literal text emerging in hidden states; LEACE concept erasure confirms text representations are causally necessary in both architectures tested, collapsing accuracy to near-zero. Qwen2-Audio genuinely diverges, revealing cascade equivalence is architecture-dependent, not universal. For most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0 dB.
Paper Structure (20 sections, 1 equation, 6 figures, 7 tables)

This paper contains 20 sections, 1 equation, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Cohen's $\kappa$ between each E2E model and its matched-backbone cascade ($^\dagger$) or Cascade-S. Rows are ordered by decreasing mean $\kappa$. 95% bootstrap CIs (1,000 resamples) are ${\pm}0.01$--$0.02$ on text-sufficient tasks and ${\pm}0.03$--$0.07$ on MELD/MUStARD. Ultravox shows consistently high agreement, while Qwen2-Audio and Phi-4-MM show lower and more variable agreement.
  • Figure 2: Conditional error overlap: $P(\hat{y}_{e2e}{=}\hat{y}_{cas} \mid \text{both wrong})$ for each speech LLM vs. its cascade counterpart on multi-class tasks. Values near 1.0 indicate that when both systems err, they produce the same wrong answer, the signature of a shared reasoning pathway. Dashed lines mark chance baselines ($1/(|C|{-}1)$). Matched-backbone pairs ($^\dagger$) consistently achieve higher overlap than mismatched pairs (vs Cas-S), confirming that shared failures arise from the LLM backbone rather than the audio encoder.
  • Figure 3: Accuracy vs. SNR for text-sufficient tasks. Cascade-S (solid line, circles) degrades gracefully across all tasks. Gemini (solid line, triangles) shows the steepest decline on SST-2 and CSQA despite superior clean performance, suggesting higher noise sensitivity in its internal speech processing.
  • Figure 4: Layer-wise probing. Left axis: acoustic probe $R^2$ for energy (solid, circles) and pitch (dashed, squares). Right axis: CTC text decodability (solid, triangles) and bag-of-characters $R^2$ (dashed, diamonds). Qwen2-Audio shows the strongest acoustic compression; Ultravox retains acoustics while building text representations.
  • Figure 5: Logit lens: mean bag-of-tokens precision by layer (with RMSNorm applied before projection). Both models show text emergence from L20 onward. Ultravox (diamonds) surges to $0.34$ at L31; Qwen2-Audio (triangles) peaks at $0.23$ (L28). The degree of text emergence mirrors the behavioral agreement spectrum.
  • ...and 1 more figures