Table of Contents
Fetching ...

Adapter Incremental Continual Learning of Efficient Audio Spectrogram Transformers

Nithish Muthuchamy Selvaraj, Xiaobao Guo, Adams Kong, Bingquan Shen, Alex Kot

TL;DR

This paper tackles the challenge of Task Incremental Continual Learning (TI-CL) for audio classification using Audio Spectrogram Transformers (AST) by addressing both parameter and computational inefficiencies. It introduces Adapter Incremental Continual Learning (AI-CL), which freezes the AST backbone and attaches per-task Convolutional Adapters to create task-specific sub-networks with minimal trainable parameters, and Frequency-Time factorized Attention (FTA) to dramatically reduce self-attention compute. The approach demonstrates that AI-CL can prevent catastrophic forgetting across sequential tasks on ESC-50, SCv2, and AVE while maintaining a low compute budget, aided by PET methods that identify effective adapters such as ConvPass and AdaptFormer. Overall, the method enables scalable TI-CL for long-duration audio by combining parameter-efficient adapters with compute-efficient attention, offering practical gains for real-world continual learning systems.

Abstract

Continual learning involves training neural networks incrementally for new tasks while retaining the knowledge of previous tasks. However, efficiently fine-tuning the model for sequential tasks with minimal computational resources remains a challenge. In this paper, we propose Task Incremental Continual Learning (TI-CL) of audio classifiers with both parameter-efficient and compute-efficient Audio Spectrogram Transformers (AST). To reduce the trainable parameters without performance degradation for TI-CL, we compare several Parameter Efficient Transfer (PET) methods and propose AST with Convolutional Adapters for TI-CL, which has less than 5% of trainable parameters of the fully fine-tuned counterparts. To reduce the computational complexity, we introduce a novel Frequency-Time factorized Attention (FTA) method that replaces the traditional self-attention in transformers for audio spectrograms. FTA achieves competitive performance with only a factor of the computations required by Global Self-Attention (GSA). Finally, we formulate our method for TI-CL, called Adapter Incremental Continual Learning (AI-CL), as a combination of the "parameter-efficient" Convolutional Adapter and the "compute-efficient" FTA. Experiments on ESC-50, SpeechCommandsV2 (SCv2), and Audio-Visual Event (AVE) benchmarks show that our proposed method prevents catastrophic forgetting in TI-CL while maintaining a lower computational budget.

Adapter Incremental Continual Learning of Efficient Audio Spectrogram Transformers

TL;DR

This paper tackles the challenge of Task Incremental Continual Learning (TI-CL) for audio classification using Audio Spectrogram Transformers (AST) by addressing both parameter and computational inefficiencies. It introduces Adapter Incremental Continual Learning (AI-CL), which freezes the AST backbone and attaches per-task Convolutional Adapters to create task-specific sub-networks with minimal trainable parameters, and Frequency-Time factorized Attention (FTA) to dramatically reduce self-attention compute. The approach demonstrates that AI-CL can prevent catastrophic forgetting across sequential tasks on ESC-50, SCv2, and AVE while maintaining a low compute budget, aided by PET methods that identify effective adapters such as ConvPass and AdaptFormer. Overall, the method enables scalable TI-CL for long-duration audio by combining parameter-efficient adapters with compute-efficient attention, offering practical gains for real-world continual learning systems.

Abstract

Continual learning involves training neural networks incrementally for new tasks while retaining the knowledge of previous tasks. However, efficiently fine-tuning the model for sequential tasks with minimal computational resources remains a challenge. In this paper, we propose Task Incremental Continual Learning (TI-CL) of audio classifiers with both parameter-efficient and compute-efficient Audio Spectrogram Transformers (AST). To reduce the trainable parameters without performance degradation for TI-CL, we compare several Parameter Efficient Transfer (PET) methods and propose AST with Convolutional Adapters for TI-CL, which has less than 5% of trainable parameters of the fully fine-tuned counterparts. To reduce the computational complexity, we introduce a novel Frequency-Time factorized Attention (FTA) method that replaces the traditional self-attention in transformers for audio spectrograms. FTA achieves competitive performance with only a factor of the computations required by Global Self-Attention (GSA). Finally, we formulate our method for TI-CL, called Adapter Incremental Continual Learning (AI-CL), as a combination of the "parameter-efficient" Convolutional Adapter and the "compute-efficient" FTA. Experiments on ESC-50, SpeechCommandsV2 (SCv2), and Audio-Visual Event (AVE) benchmarks show that our proposed method prevents catastrophic forgetting in TI-CL while maintaining a lower computational budget.
Paper Structure (15 sections, 4 equations, 3 figures, 4 tables)

This paper contains 15 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Adapter Incremental Continual Learning of Audio Spectrogram Transformers.
  • Figure 2: Frequency-Time factorized Attention for a (yellow) token along the frequency and time axis.
  • Figure 3: Performance of the AST model in TI-CL setup for three training modes.