Table of Contents
Fetching ...

CALM: Class-Conditional Sparse Attention Vectors for Large Audio-Language Models

Videet Mehta, Liming Wang, Hilde Kuehne, Rogerio Feris, James R. Glass, M. Jehanzeb Mirza

TL;DR

This work tackles the challenge that frozen large audio–language models underperform on discriminative tasks compared with specialized models. It extends Sparse Attention Vectors by introducing CALM, a class-conditioned head weighting scheme that learns per-class head reliabilities from few-shot data, enabling weighted voting across selected attention heads without fine-tuning. CALM demonstrates consistent improvements over uniform head voting across audio and audio–visual benchmarks (e.g., gains up to 14.52% on AudioSet) and supports a self-supervised alternative via pseudo-labeling. The approach reveals clear head specialization in later LALM layers and offers a practical, finetuning-free method to repurpose LALMs for discriminative tasks with strong performance and scalability benefits.

Abstract

Large audio-language models (LALMs) exhibit strong zero-shot capabilities in multiple downstream tasks, such as audio question answering (AQA) and abstract reasoning; however, these models still lag behind specialized models for certain discriminative tasks (e.g., audio classification). Recent studies show that sparse subsets of attention heads within an LALM can serve as strong discriminative feature extractors for downstream tasks such as classification via simple voting schemes. However, these methods assign uniform weights to all selected heads, implicitly assuming that each head contributes equally across all semantic categories. In this work, we propose Class-Conditional Sparse Attention Vectors for Large Audio-Language Models, a few-shot classification method that learns class-dependent importance weights over attention heads. This formulation allows individual heads to specialize in distinct semantic categories and to contribute to ensemble predictions proportionally to their estimated reliability. Experiments on multiple few-shot audio and audiovisual classification benchmarks and tasks demonstrate that our method consistently outperforms state-of-the-art uniform voting-based approaches by up to 14.52%, 1.53%, 8.35% absolute gains for audio classification, audio-visual classification, and spoofing detection respectively.

CALM: Class-Conditional Sparse Attention Vectors for Large Audio-Language Models

TL;DR

This work tackles the challenge that frozen large audio–language models underperform on discriminative tasks compared with specialized models. It extends Sparse Attention Vectors by introducing CALM, a class-conditioned head weighting scheme that learns per-class head reliabilities from few-shot data, enabling weighted voting across selected attention heads without fine-tuning. CALM demonstrates consistent improvements over uniform head voting across audio and audio–visual benchmarks (e.g., gains up to 14.52% on AudioSet) and supports a self-supervised alternative via pseudo-labeling. The approach reveals clear head specialization in later LALM layers and offers a practical, finetuning-free method to repurpose LALMs for discriminative tasks with strong performance and scalability benefits.

Abstract

Large audio-language models (LALMs) exhibit strong zero-shot capabilities in multiple downstream tasks, such as audio question answering (AQA) and abstract reasoning; however, these models still lag behind specialized models for certain discriminative tasks (e.g., audio classification). Recent studies show that sparse subsets of attention heads within an LALM can serve as strong discriminative feature extractors for downstream tasks such as classification via simple voting schemes. However, these methods assign uniform weights to all selected heads, implicitly assuming that each head contributes equally across all semantic categories. In this work, we propose Class-Conditional Sparse Attention Vectors for Large Audio-Language Models, a few-shot classification method that learns class-dependent importance weights over attention heads. This formulation allows individual heads to specialize in distinct semantic categories and to contribute to ensemble predictions proportionally to their estimated reliability. Experiments on multiple few-shot audio and audiovisual classification benchmarks and tasks demonstrate that our method consistently outperforms state-of-the-art uniform voting-based approaches by up to 14.52%, 1.53%, 8.35% absolute gains for audio classification, audio-visual classification, and spoofing detection respectively.
Paper Structure (26 sections, 12 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 26 sections, 12 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Fine-tuning–free audio classification with CALM. Given a frozen audio–language model, we extract per-head last-token hidden-state representations and compute class centroids. Each attention head produces cosine-similarity scores to all class centroids. For every head–class pair, we estimate a reliability score based on the margin to the next most confident class, which is visualized by the arrow width and color in the figure. Highly reliable heads contribute strongly (thick, saturated arrows), while unreliable heads are suppressed by lower weights. At inference time, class predictions are obtained via reliability-weighted soft voting across all heads. This yields accurate classification without task-specific fine-tuning.
  • Figure 2: Overview of CALM: We extract attention-head vectors from a frozen LALM given an audio input and prompt. Then we build class centroids from few-shot training examples and estimate a class-conditional head reliability matrix using margin-based confidence and sparsifying to $k$ head experts per class. At inference, CALM combines centroid similarities with these reliability weights to perform weighted voting over heads. This allows for finetuning-free classification.
  • Figure 3: Per-class weight survival functions for VGGSound. CALM learns class-specific sparse weightings, with different classes concentrating weight onto different subsets of attention heads. Dashed lines show uniform-voting baselines for $k = 100, 300, 500,$ and $1024$ heads.
  • Figure 4: Class-specific vs. global head weighting. Locally weighted (LW) classification consistently outperforms global weighting (GW), particularly when more heads are available, indicating class-level specialization of attention heads.
  • Figure 5: Effect of training shots and head sparsity. Performance improves with more shots and peaks at approximately 30--50% head sparsity, especially on simpler classification tasks.
  • ...and 1 more figures