Causal Tracing of Audio-Text Fusion in Large Audio Language Models

Wei-Chih Chen; Chien-yu Huang; Hung-yi Lee

Causal Tracing of Audio-Text Fusion in Large Audio Language Models

Wei-Chih Chen, Chien-yu Huang, Hung-yi Lee

Abstract

Despite the strong performance of large audio language models (LALMs) in various tasks, exactly how and where they integrate acoustic features with textual context remains unclear. We adapt causal tracing to investigate the internal information flow of LALMs during audio comprehension. By conducting layer-wise and token-wise analyses across DeSTA, Qwen, and Voxtral, we evaluate the causal effects of individual hidden states. Layer-wise analysis identifies different fusion strategies, from progressive integration in DeSTA to abrupt late-stage fusion in Qwen. Token-wise analysis shows that the final sequence token acts as an informational bottleneck where the network decisively retrieves relevant information from the audio. We also observe an attention-like query mechanism at intermediate token positions that triggers the model to pull task-relevant audio context. These findings provide a clear characterization of when and where multi-modal integration occurs within LALMs.

Causal Tracing of Audio-Text Fusion in Large Audio Language Models

Abstract

Paper Structure (13 sections, 1 equation, 1 figure)

This paper contains 13 sections, 1 equation, 1 figure.

Method
Setup
Layer-wise Tracing
Token-wise Tracing
Experimental Settings
Dataset
Models
Results
Depth of Multi-Modal Integration
Spatial Localization of Audio-Text Fusion
Implications for LALM Design
Conclusion
Generative AI Usage Disclosure

Figures (1)

Figure 3: Token-wise tracing results across all models and four auditory attributes reveal the last token as a critical informational bottleneck for audio context retrieval, while object tokens trigger an attention-like query mechanism to extract specific target attributes.

Causal Tracing of Audio-Text Fusion in Large Audio Language Models

Abstract

Causal Tracing of Audio-Text Fusion in Large Audio Language Models

Authors

Abstract

Table of Contents

Figures (1)