Table of Contents
Fetching ...

Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs

Ming-Hao Hsu, Xueyao Zhang, Xiaohai Tian, Jun Zhang, Zhizheng Wu

TL;DR

This paper finds that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech where semantic content spans multiple frames, and analyses how speech and text representations evolve layer-by-layer, suggesting that the bottleneck lies in condensing redundant speech into stable late-layer decisions.

Abstract

Recent advancements in Large Speech-Language Models have significantly bridged the gap between acoustic signals and linguistic understanding. However, a persistent performance disparity remains in speech-based input tasks compared to direct text inference. In this paper, we investigate the dynamic roots of this modality gap beyond static geometric alignment, analyzing how speech and text representations evolve layer-by-layer. We evaluate four open-weight end-to-end models on SpeechMMLU and VoiceBench BBH. Using cross-layer CKA analysis with speech-text token alignment, we find that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech where semantic content spans multiple frames. We show that these alignment patterns are structurally stable across different analysis configurations. Crucially, simple statistical calibration is insufficient and can be detrimental when applied at the input layer, indicating that the modality gap is not a mere distribution shift. Overall, our results suggest that the bottleneck lies in condensing redundant speech into stable late-layer decisions, motivating future solutions that operate at the token or temporal granularity instead of feature-level matching.

Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs

TL;DR

This paper finds that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech where semantic content spans multiple frames, and analyses how speech and text representations evolve layer-by-layer, suggesting that the bottleneck lies in condensing redundant speech into stable late-layer decisions.

Abstract

Recent advancements in Large Speech-Language Models have significantly bridged the gap between acoustic signals and linguistic understanding. However, a persistent performance disparity remains in speech-based input tasks compared to direct text inference. In this paper, we investigate the dynamic roots of this modality gap beyond static geometric alignment, analyzing how speech and text representations evolve layer-by-layer. We evaluate four open-weight end-to-end models on SpeechMMLU and VoiceBench BBH. Using cross-layer CKA analysis with speech-text token alignment, we find that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech where semantic content spans multiple frames. We show that these alignment patterns are structurally stable across different analysis configurations. Crucially, simple statistical calibration is insufficient and can be detrimental when applied at the input layer, indicating that the modality gap is not a mere distribution shift. Overall, our results suggest that the bottleneck lies in condensing redundant speech into stable late-layer decisions, motivating future solutions that operate at the token or temporal granularity instead of feature-level matching.
Paper Structure (51 sections, 6 equations, 22 figures, 7 tables)

This paper contains 51 sections, 6 equations, 22 figures, 7 tables.

Figures (22)

  • Figure 1: T2T and S2T accuracy comparison across four models. The modality gap, measured as the accuracy drop from text input to speech input, is consistent across all models and both benchmarks.
  • Figure 2: Cross-modal cross-layer CKA heatmaps. Each heatmap compares S2T layers on the y-axis against T2T layers on the x-axis. The early dark zone indicates Phase I heterogeneous projection, the broad diagonal band reflects Phase II semantic smearing, and late-layer stagnation reveals Phase III decision instability.
  • Figure 3: Phase boundary visualization from CKA summaries. Black lines show best-match paths; point colors indicate row-wise peak alignment strength. Background shading marks the three processing phases. The late-layer plateau reveals where speech representations stall before reaching the text head.
  • Figure 4: Micro-evidence of Information Dilution (four representative cases). Each panel traces the decision-token attention back to the input span without any resampling. Across diverse questions, text attention consistently sharpens onto a small set of tokens, whereas speech attention spreads mass across many acoustic tokens. The rank-CDF in each panel quantifies this dispersion without requiring sequence-length alignment.
  • Figure 5: Cross-modal synchronization of layer-wise updates. The heatmap shows cosine similarity between speech and text update vectors at each layer. The clear diagonal pattern indicates that both modalities apply similar transformations at each layer, even though speech representations are more spread out due to redundancy.
  • ...and 17 more figures