Adapter Incremental Continual Learning of Efficient Audio Spectrogram Transformers
Nithish Muthuchamy Selvaraj, Xiaobao Guo, Adams Kong, Bingquan Shen, Alex Kot
TL;DR
This paper tackles the challenge of Task Incremental Continual Learning (TI-CL) for audio classification using Audio Spectrogram Transformers (AST) by addressing both parameter and computational inefficiencies. It introduces Adapter Incremental Continual Learning (AI-CL), which freezes the AST backbone and attaches per-task Convolutional Adapters to create task-specific sub-networks with minimal trainable parameters, and Frequency-Time factorized Attention (FTA) to dramatically reduce self-attention compute. The approach demonstrates that AI-CL can prevent catastrophic forgetting across sequential tasks on ESC-50, SCv2, and AVE while maintaining a low compute budget, aided by PET methods that identify effective adapters such as ConvPass and AdaptFormer. Overall, the method enables scalable TI-CL for long-duration audio by combining parameter-efficient adapters with compute-efficient attention, offering practical gains for real-world continual learning systems.
Abstract
Continual learning involves training neural networks incrementally for new tasks while retaining the knowledge of previous tasks. However, efficiently fine-tuning the model for sequential tasks with minimal computational resources remains a challenge. In this paper, we propose Task Incremental Continual Learning (TI-CL) of audio classifiers with both parameter-efficient and compute-efficient Audio Spectrogram Transformers (AST). To reduce the trainable parameters without performance degradation for TI-CL, we compare several Parameter Efficient Transfer (PET) methods and propose AST with Convolutional Adapters for TI-CL, which has less than 5% of trainable parameters of the fully fine-tuned counterparts. To reduce the computational complexity, we introduce a novel Frequency-Time factorized Attention (FTA) method that replaces the traditional self-attention in transformers for audio spectrograms. FTA achieves competitive performance with only a factor of the computations required by Global Self-Attention (GSA). Finally, we formulate our method for TI-CL, called Adapter Incremental Continual Learning (AI-CL), as a combination of the "parameter-efficient" Convolutional Adapter and the "compute-efficient" FTA. Experiments on ESC-50, SpeechCommandsV2 (SCv2), and Audio-Visual Event (AVE) benchmarks show that our proposed method prevents catastrophic forgetting in TI-CL while maintaining a lower computational budget.
