Table of Contents
Fetching ...

AR&D: A Framework for Retrieving and Describing Concepts for Interpreting AudioLLMs

Townim Faisal Chowdhury, Ta Duc Huy, Siqi Pan, Jeremy Stoddard, Zhibin Liao

TL;DR

This work introduces the first mechanistic interpretability framework for AudioLLMs, leveraging sparse autoencoders (SAEs) to disentangle polysemantic activations into monosemantic features and provides a foundation for trustworthy deployment in high-stakes domains.

Abstract

Despite strong performance in audio perception tasks, large audio-language models (AudioLLMs) remain opaque to interpretation. A major factor behind this lack of interpretability is that individual neurons in these models frequently activate in response to several unrelated concepts. We introduce the first mechanistic interpretability framework for AudioLLMs, leveraging sparse autoencoders (SAEs) to disentangle polysemantic activations into monosemantic features. Our pipeline identifies representative audio clips, assigns meaningful names via automated captioning, and validates concepts through human evaluation and steering. Experiments show that AudioLLMs encode structured and interpretable features, enhancing transparency and control. This work provides a foundation for trustworthy deployment in high-stakes domains and enables future extensions to larger models, multilingual audio, and more fine-grained paralinguistic features. Project URL: https://townim-faisal.github.io/AutoInterpret-AudioLLM/

AR&D: A Framework for Retrieving and Describing Concepts for Interpreting AudioLLMs

TL;DR

This work introduces the first mechanistic interpretability framework for AudioLLMs, leveraging sparse autoencoders (SAEs) to disentangle polysemantic activations into monosemantic features and provides a foundation for trustworthy deployment in high-stakes domains.

Abstract

Despite strong performance in audio perception tasks, large audio-language models (AudioLLMs) remain opaque to interpretation. A major factor behind this lack of interpretability is that individual neurons in these models frequently activate in response to several unrelated concepts. We introduce the first mechanistic interpretability framework for AudioLLMs, leveraging sparse autoencoders (SAEs) to disentangle polysemantic activations into monosemantic features. Our pipeline identifies representative audio clips, assigns meaningful names via automated captioning, and validates concepts through human evaluation and steering. Experiments show that AudioLLMs encode structured and interpretable features, enhancing transparency and control. This work provides a foundation for trustworthy deployment in high-stakes domains and enables future extensions to larger models, multilingual audio, and more fine-grained paralinguistic features. Project URL: https://townim-faisal.github.io/AutoInterpret-AudioLLM/
Paper Structure (8 sections, 4 equations, 2 figures, 4 tables)

This paper contains 8 sections, 4 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the Audio Retrieve and Describe (AR&D) pipeline to discover and naming interpretable concepts in AudioLLM. Stage 1: The SAE is trained to reconstruct representations $\mathbf{x}$ from the AudioLLM, yielding a latent space of sparse, monosemantic features. Stage 2: Using a probing dataset $\mathcal{A}$, we compute SAE activations $\mathbf{Z}$ and calculate a representativeness scores by $F(\cdot)$ for each feature, selecting the $p$ most and least representative audio clips ($H^k$ and $L^k$) per feature. Stage 3: We filter top features using monosemanticity scores derived from $H^k$ and $L^k$, and interpret them by generating and summarizing captions from representative clips $H^k$, producing a final set of human-understandable concepts.
  • Figure 2: Illustration of the steering mechanism. From a given layer of AudioLLM, the obtained input $\mathbf{x}$ is transformed into an SAE representation $\mathbf{z}$. The $k$-th feature (i.e., the targeted concept) is then replaced with a predefined steering value (e.g., up from 2.5 to 4.0 in the example). The modified representation $\hat{\mathbf{z}}$ is processed by a TopK operator and decoded into $\hat{\mathbf{x}}$ (Eq. \ref{['eq:sae']}), which is subsequently fed through the rest AudioLLM layers (as marked by the dotted lines), replacing $\mathbf{x}$, allowing fine-grained control over specific features in the model.