Table of Contents
Fetching ...

AHAMask: Reliable Task Specification for Large Audio Language Models without Instructions

Yiwei Guo, Bohan Li, Hankun Wang, Zhihan Li, Shuai Wang, Xie Chen, Kai Yu

TL;DR

The paper tackles prompt sensitivity in large audio language models by introducing AHAMask, a method that learns a binary mask over the decoder's attention heads to elicit task-specific acoustic functionalities without instructions. The approach is highly parameter-efficient, training only the mask parameters equal to the number of heads, and it demonstrates that AHAMask can match or exceed instruction-based performance on single tasks and improve reliability on composite tasks. It also provides evidence for acoustic functional pathways within LALMs, offering a new lens for interpretability and modular control. Practically, AHAMask enables robust, instruction-free task specification in multi-modal LALMs, with potential benefits for reliability and deployment in audio-centric AI systems.

Abstract

Although current large audio language models (LALMs) extend text large language models (LLMs) with generic acoustic understanding abilities, they usually suffer from prompt sensitivity, where different instructions of the same intention can yield drastically different outcomes. In this work, we propose AHAMask, where we simply mask some of the attention heads in the decoder-only LLM backbone of LALMs, to trigger specific acoustic task functionalities without instructions. These masks are efficiently obtained by training on an LALM, with the number of trainable parameters equal to the attention head count in its LLM backbone. We show by experiments that applying such selective attention head masks achieves comparable or even better performance than using instructions, either on single or composite tasks. Besides achieving reliable acoustic task specification for LALMs, this also reveals that LALMs exhibit certain "functional pathways" in their attention heads.

AHAMask: Reliable Task Specification for Large Audio Language Models without Instructions

TL;DR

The paper tackles prompt sensitivity in large audio language models by introducing AHAMask, a method that learns a binary mask over the decoder's attention heads to elicit task-specific acoustic functionalities without instructions. The approach is highly parameter-efficient, training only the mask parameters equal to the number of heads, and it demonstrates that AHAMask can match or exceed instruction-based performance on single tasks and improve reliability on composite tasks. It also provides evidence for acoustic functional pathways within LALMs, offering a new lens for interpretability and modular control. Practically, AHAMask enables robust, instruction-free task specification in multi-modal LALMs, with potential benefits for reliability and deployment in audio-centric AI systems.

Abstract

Although current large audio language models (LALMs) extend text large language models (LLMs) with generic acoustic understanding abilities, they usually suffer from prompt sensitivity, where different instructions of the same intention can yield drastically different outcomes. In this work, we propose AHAMask, where we simply mask some of the attention heads in the decoder-only LLM backbone of LALMs, to trigger specific acoustic task functionalities without instructions. These masks are efficiently obtained by training on an LALM, with the number of trainable parameters equal to the attention head count in its LLM backbone. We show by experiments that applying such selective attention head masks achieves comparable or even better performance than using instructions, either on single or composite tasks. Besides achieving reliable acoustic task specification for LALMs, this also reveals that LALMs exhibit certain "functional pathways" in their attention heads.

Paper Structure

This paper contains 27 sections, 4 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Diagram of a typical large audio language model. Left: the original model, which requires a text instruction to perform a specific task, but is sensitive to instructions. Right: the large audio language model with AHAMask (acoustic attention head mask), where only a set of attention heads are activated for a specific task. This frees the need for text instructions.
  • Figure 2: Prompt sensitivity experiments on SALMONN. Different colors and columns denote different types of variations.
  • Figure 3: Jaccard similarities of AHAMask between different tasks in each LALM.
  • Figure 4: Performance of SALMONN on 4 non-classification tasks with AHAMask in different percentage of activated attention heads. The orange triangle marker denotes the metric in Table \ref{['tab:single-salmonn']}
  • Figure 5: The effect of the penalty coefficient $\lambda$ on the number of activated attention heads on SALMONN. Larger $\lambda$ leads to fewer activated heads (green solid lines), but not necessarily worse performance (red dashed lines). Note that the difference in performance caused by different $\lambda$ is usually small, i.e. the right y-axis usually has a small range.
  • ...and 1 more figures