Table of Contents
Fetching ...

Do Audio-Visual Large Language Models Really See and Hear?

Ramaneswaran Selvakumar, Kaousheik Jayakumar, S Sakshi, Sreyan Ghosh, Ruohan Gao, Dinesh Manocha

Abstract

Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study of AVLLMs, analyzing how audio and visual features evolve and fuse through different layers of an AVLLM to produce the final text outputs. We find that although AVLLMs encode rich audio semantics at intermediate layers, these capabilities largely fail to surface in the final text generation when audio conflicts with vision. Probing analyses show that useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues. We further trace this imbalance to training: the AVLLM's audio behavior strongly matches its vision-language base model, indicating limited additional alignment to audio supervision. Our findings reveal a fundamental modality bias in AVLLMs and provide new mechanistic insights into how multimodal LLMs integrate audio and vision.

Do Audio-Visual Large Language Models Really See and Hear?

Abstract

Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study of AVLLMs, analyzing how audio and visual features evolve and fuse through different layers of an AVLLM to produce the final text outputs. We find that although AVLLMs encode rich audio semantics at intermediate layers, these capabilities largely fail to surface in the final text generation when audio conflicts with vision. Probing analyses show that useful latent audio information is present, but deeper fusion layers disproportionately privilege visual representations that tend to suppress audio cues. We further trace this imbalance to training: the AVLLM's audio behavior strongly matches its vision-language base model, indicating limited additional alignment to audio supervision. Our findings reveal a fundamental modality bias in AVLLMs and provide new mechanistic insights into how multimodal LLMs integrate audio and vision.

Paper Structure

This paper contains 26 sections, 6 equations, 15 figures, 1 algorithm.

Figures (15)

  • Figure 1: Illustration of visual bias. AVLLMs exhibit a critical modality bias, often prioritizing visual cues over vital audio cues. The diagram illustrates a counterfactual scene, visible objects (a blue car and a woman walking a dog) are silent and the only audible sound is an out-of-view ambulance siren. When prompted to describe the scene, the AVLLM hallucinates audio events (car engine, dog barking) and misses the actual siren sound.
  • Figure 2: Audio Understanding Performance. Audio understanding severely degrades under audio-visual conflict.
  • Figure 3: Caption evaluation. We use an open-source reasoning LLM to evaluate audio-visual captions by assessing temporal sequences, object attributes, and cross-modal relationships. This approach is interpretable (explicit reasoning for scores) and flexible, we can calibrate it using in-context examples.
  • Figure 4: Mean attention from generated to input tokens. Generated tokens allocate high attention to audio in early layers (40-50% in layers 0-5), which drops to near-zero afterward. Video attention steadily increases through deeper layers, reaching 20-40% in layers 15-30.
  • Figure 5: Probing Audio Representations. We decode intermediate layer audio representations using the base LLM's unembedding matrix and observe that they decode into meaningful concepts describing sound events and their sources, in multiple languages (e.g., 键盘$\blacktriangleleft$$\blacktriangleleft$/keyboard, typing).
  • ...and 10 more figures