AHAMask: Reliable Task Specification for Large Audio Language Models without Instructions
Yiwei Guo, Bohan Li, Hankun Wang, Zhihan Li, Shuai Wang, Xie Chen, Kai Yu
TL;DR
The paper tackles prompt sensitivity in large audio language models by introducing AHAMask, a method that learns a binary mask over the decoder's attention heads to elicit task-specific acoustic functionalities without instructions. The approach is highly parameter-efficient, training only the mask parameters equal to the number of heads, and it demonstrates that AHAMask can match or exceed instruction-based performance on single tasks and improve reliability on composite tasks. It also provides evidence for acoustic functional pathways within LALMs, offering a new lens for interpretability and modular control. Practically, AHAMask enables robust, instruction-free task specification in multi-modal LALMs, with potential benefits for reliability and deployment in audio-centric AI systems.
Abstract
Although current large audio language models (LALMs) extend text large language models (LLMs) with generic acoustic understanding abilities, they usually suffer from prompt sensitivity, where different instructions of the same intention can yield drastically different outcomes. In this work, we propose AHAMask, where we simply mask some of the attention heads in the decoder-only LLM backbone of LALMs, to trigger specific acoustic task functionalities without instructions. These masks are efficiently obtained by training on an LALM, with the number of trainable parameters equal to the attention head count in its LLM backbone. We show by experiments that applying such selective attention head masks achieves comparable or even better performance than using instructions, either on single or composite tasks. Besides achieving reliable acoustic task specification for LALMs, this also reveals that LALMs exhibit certain "functional pathways" in their attention heads.
