Table of Contents
Fetching ...

Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering

Neta Glazer, Lenny Aharon, Ethan Fetaya

TL;DR

This work uses mechanistic interpretability to identify a small set of audio-specialist attention heads whose audio attention yields a ``listening'' signal, and shows that this signal increases when audio evidence affects the model's output, providing an indicator of audio engagement under standard prompting.

Abstract

Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To address this issue we use mechanistic interpretability to identify a small set of audio-specialist attention heads whose audio attention yields a ``listening'' signal. We show that this signal increases when audio evidence affects the model's output, providing an indicator of audio engagement under standard prompting. Leveraging this localization, we construct an audio--silence steering direction and apply an inference-time activation intervention to the final representation, amplifying the model's audio effect. To demonstrate the utility of this intervention, we show on MMAU that this improves accuracy by up to +8.0 percentage points on two Qwen-based LALMs, without any parameter updates.

Are Audio-Language Models Listening? Audio-Specialist Heads for Adaptive Audio Steering

TL;DR

This work uses mechanistic interpretability to identify a small set of audio-specialist attention heads whose audio attention yields a ``listening'' signal, and shows that this signal increases when audio evidence affects the model's output, providing an indicator of audio engagement under standard prompting.

Abstract

Multimodal large language models can exhibit text dominance, over-relying on linguistic priors instead of grounding predictions in non-text inputs. One example is large audio-language models (LALMs) where decisive audio evidence can be under-utilized even when it contains important information. To address this issue we use mechanistic interpretability to identify a small set of audio-specialist attention heads whose audio attention yields a ``listening'' signal. We show that this signal increases when audio evidence affects the model's output, providing an indicator of audio engagement under standard prompting. Leveraging this localization, we construct an audio--silence steering direction and apply an inference-time activation intervention to the final representation, amplifying the model's audio effect. To demonstrate the utility of this intervention, we show on MMAU that this improves accuracy by up to +8.0 percentage points on two Qwen-based LALMs, without any parameter updates.
Paper Structure (10 sections, 9 equations, 2 figures, 4 tables)

This paper contains 10 sections, 9 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Specialist-Guided Steering (SGS).(a) We identify audio-specialist attention heads by computing each head’s audio-attention share and selecting the Top-$K$ heads whose audio attention is most predictive of correctness on a calibration set. (b) We run audio and matched-duration silence forward passes and form a layer-localized steering direction by aggregating residual differences $(\mathbf{h}^{\text{aud}}_\ell-\mathbf{h}^{\text{sil}}_\ell)$ over the specialist layer set $\mathcal{L}$ (layers containing the discovered heads). We scale this direction by $\beta$ and add it to the final representation to obtain $\mathbf{h}^*$ for prediction.
  • Figure 2: Effect of steering strength $\beta$ and specialist count $K$ on performance for R1-AQA (top) and Qwen2-Audio-7B (bottom). Lines show improvement in percentage points (pp) for different Top-$K$ specialist head sets; each $K$ induces a specialist layer set $\mathcal{L}$ and we apply layer-localized steering within $\mathcal{L}$.