Table of Contents
Fetching ...

Investigating Modality Contribution in Audio LLMs for Music

Giovana Morais, Magdalena Fuentes

TL;DR

This paper tackles whether Audio LLMs truly listen to audio or rely on textual reasoning by adapting MM-SHAP to quantify modality contributions. It introduces a Shapley-value–driven framework for audio/text masking, applies PermutationSHAP with $m=10$ samples to approximate token-level contributions, and defines modality scores $A$-SHAP and $T$-SHAP. Evaluating two Audio LLMs (Qwen-Audio and MU-LLaMA) on the MuChoMusic benchmark, the study finds that higher accuracy does not imply heavier audio reliance; yet audio signals can localize key sound events, indicating that audio is not entirely ignored. The work represents the first application of MM-SHAP to Audio LLMs and provides a foundation for explainable AI analyses in audio-based reasoning and multimodal integration.

Abstract

Audio Large Language Models (Audio LLMs) enable human-like conversation about music, yet it is unclear if they are truly listening to the audio or just using textual reasoning, as recent benchmarks suggest. This paper investigates this issue by quantifying the contribution of each modality to a model's output. We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model's prediction. We evaluate two models on the MuChoMusic benchmark and find that the model with higher accuracy relies more on text to answer questions, but further inspection shows that even if the overall audio contribution is low, models can successfully localize key sound events, suggesting that audio is not entirely ignored. Our study is the first application of MM-SHAP to Audio LLMs and we hope it will serve as a foundational step for future research in explainable AI and audio.

Investigating Modality Contribution in Audio LLMs for Music

TL;DR

This paper tackles whether Audio LLMs truly listen to audio or rely on textual reasoning by adapting MM-SHAP to quantify modality contributions. It introduces a Shapley-value–driven framework for audio/text masking, applies PermutationSHAP with samples to approximate token-level contributions, and defines modality scores -SHAP and -SHAP. Evaluating two Audio LLMs (Qwen-Audio and MU-LLaMA) on the MuChoMusic benchmark, the study finds that higher accuracy does not imply heavier audio reliance; yet audio signals can localize key sound events, indicating that audio is not entirely ignored. The work represents the first application of MM-SHAP to Audio LLMs and provides a foundation for explainable AI analyses in audio-based reasoning and multimodal integration.

Abstract

Audio Large Language Models (Audio LLMs) enable human-like conversation about music, yet it is unclear if they are truly listening to the audio or just using textual reasoning, as recent benchmarks suggest. This paper investigates this issue by quantifying the contribution of each modality to a model's output. We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model's prediction. We evaluate two models on the MuChoMusic benchmark and find that the model with higher accuracy relies more on text to answer questions, but further inspection shows that even if the overall audio contribution is low, models can successfully localize key sound events, suggesting that audio is not entirely ignored. Our study is the first application of MM-SHAP to Audio LLMs and we hope it will serve as a foundational step for future research in explainable AI and audio.

Paper Structure

This paper contains 11 sections, 3 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: We compute Shapley values by masking all combinations of inputs (approximated via random permutation), and averaging the change in the logits of the base answer (unmasked inference indicated in solid line). We mask text tokens and audio waveform segments.
  • Figure 2: QwenAudio MC-NPI results for MusicCaps track QK-mjNg8cPo: the output is "The sound effect that can be heard in the piece is a bell sound effect". The image shows the modality contribution for output token "bell", i.e., $\Phi_{A,t}$ and $\Phi_{T,t}$, where $t = \text{bell}$. The top section highlights the most important text tokens from the input question (black = highest contribution, with a threshold of 80% of the maximum Shapley value for readability). The bottom section shows the audio waveform with its corresponding Shapley value contributions: absolute value, positive, and negative components. Darker colors mean higher contribution to the model's output token.