Table of Contents
Fetching ...

AAT: Adapting Audio Transformer for Various Acoustics Recognition Tasks

Yun Liang, Hai Lin, Shaojian Qiu, Yihang Zhang

TL;DR

This work tackles transferring pre-trained audio Transformers to diverse downstream acoustic tasks without sacrificing generalization or incurring prohibitive fine-tuning costs. The authors propose AAT, freezing the backbone and introducing two adapters: a bottleneck MLP Adapter in parallel with the MLP and a Spatial Adapter after MHSA, enabling task-specific adaptation while preserving pre-trained representations. Experiments on six datasets across event, speech, and music tasks show AAT matches or surpasses full fine-tuning while updating only 7.118% of parameters and achieving faster per-epoch training. The study demonstrates the practical value of PEFT in acoustics recognition and highlights conditions under which SSL vs SL pre-training benefits AAT.

Abstract

Recently, Transformers have been introduced into the field of acoustics recognition. They are pre-trained on large-scale datasets using methods such as supervised learning and semi-supervised learning, demonstrating robust generality--It fine-tunes easily to downstream tasks and shows more robust performance. However, the predominant fine-tuning method currently used is still full fine-tuning, which involves updating all parameters during training. This not only incurs significant memory usage and time costs but also compromises the model's generality. Other fine-tuning methods either struggle to address this issue or fail to achieve matching performance. Therefore, we conducted a comprehensive analysis of existing fine-tuning methods and proposed an efficient fine-tuning approach based on Adapter tuning, namely AAT. The core idea is to freeze the audio Transformer model and insert extra learnable Adapters, efficiently acquiring downstream task knowledge without compromising the model's original generality. Extensive experiments have shown that our method achieves performance comparable to or even superior to full fine-tuning while optimizing only 7.118% of the parameters. It also demonstrates superiority over other fine-tuning methods.

AAT: Adapting Audio Transformer for Various Acoustics Recognition Tasks

TL;DR

This work tackles transferring pre-trained audio Transformers to diverse downstream acoustic tasks without sacrificing generalization or incurring prohibitive fine-tuning costs. The authors propose AAT, freezing the backbone and introducing two adapters: a bottleneck MLP Adapter in parallel with the MLP and a Spatial Adapter after MHSA, enabling task-specific adaptation while preserving pre-trained representations. Experiments on six datasets across event, speech, and music tasks show AAT matches or surpasses full fine-tuning while updating only 7.118% of parameters and achieving faster per-epoch training. The study demonstrates the practical value of PEFT in acoustics recognition and highlights conditions under which SSL vs SL pre-training benefits AAT.

Abstract

Recently, Transformers have been introduced into the field of acoustics recognition. They are pre-trained on large-scale datasets using methods such as supervised learning and semi-supervised learning, demonstrating robust generality--It fine-tunes easily to downstream tasks and shows more robust performance. However, the predominant fine-tuning method currently used is still full fine-tuning, which involves updating all parameters during training. This not only incurs significant memory usage and time costs but also compromises the model's generality. Other fine-tuning methods either struggle to address this issue or fail to achieve matching performance. Therefore, we conducted a comprehensive analysis of existing fine-tuning methods and proposed an efficient fine-tuning approach based on Adapter tuning, namely AAT. The core idea is to freeze the audio Transformer model and insert extra learnable Adapters, efficiently acquiring downstream task knowledge without compromising the model's original generality. Extensive experiments have shown that our method achieves performance comparable to or even superior to full fine-tuning while optimizing only 7.118% of the parameters. It also demonstrates superiority over other fine-tuning methods.
Paper Structure (18 sections, 4 equations, 3 figures, 4 tables)

This paper contains 18 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Performance comparison on Openmic dataset. Our proposed AAT achieves the highest accuracy while enjoying a significantly smaller number of tuning parameters.
  • Figure 2: Brief illustration of full fine-tuning (a) and fine-tuning with Adapters (b) on pre-trained audio Transformer. (c) The architecture of the proposed Adapter.
  • Figure 3: The procession of Prompt tuning. $\mathrm{[CLS]}$ represent class tokens. $\mathrm{[P]}$ represent trainable Prompt tokens. $\mathrm{[E]}$ represent input spectrogram embeddings.