AAT: Adapting Audio Transformer for Various Acoustics Recognition Tasks
Yun Liang, Hai Lin, Shaojian Qiu, Yihang Zhang
TL;DR
This work tackles transferring pre-trained audio Transformers to diverse downstream acoustic tasks without sacrificing generalization or incurring prohibitive fine-tuning costs. The authors propose AAT, freezing the backbone and introducing two adapters: a bottleneck MLP Adapter in parallel with the MLP and a Spatial Adapter after MHSA, enabling task-specific adaptation while preserving pre-trained representations. Experiments on six datasets across event, speech, and music tasks show AAT matches or surpasses full fine-tuning while updating only 7.118% of parameters and achieving faster per-epoch training. The study demonstrates the practical value of PEFT in acoustics recognition and highlights conditions under which SSL vs SL pre-training benefits AAT.
Abstract
Recently, Transformers have been introduced into the field of acoustics recognition. They are pre-trained on large-scale datasets using methods such as supervised learning and semi-supervised learning, demonstrating robust generality--It fine-tunes easily to downstream tasks and shows more robust performance. However, the predominant fine-tuning method currently used is still full fine-tuning, which involves updating all parameters during training. This not only incurs significant memory usage and time costs but also compromises the model's generality. Other fine-tuning methods either struggle to address this issue or fail to achieve matching performance. Therefore, we conducted a comprehensive analysis of existing fine-tuning methods and proposed an efficient fine-tuning approach based on Adapter tuning, namely AAT. The core idea is to freeze the audio Transformer model and insert extra learnable Adapters, efficiently acquiring downstream task knowledge without compromising the model's original generality. Extensive experiments have shown that our method achieves performance comparable to or even superior to full fine-tuning while optimizing only 7.118% of the parameters. It also demonstrates superiority over other fine-tuning methods.
