Audio Mamba: Pretrained Audio State Space Model For Audio Tagging
Jiaju Lin, Haoxuan Hu
TL;DR
The paper tackles the quadratic self-attention bottleneck in audio transformers by introducing AudioMamba, a self-attention-free architecture based on state-space models that achieves linear-time complexity $O(n)$ through a multi-stage SS backbone and patch embeddings. It blends HT-SAT-inspired spectrogram patching with VMamba-based backbones to enable scalable, efficient audio tagging. Experiments on AudioSet show AudioMamba variants achieve competitive performance with substantially fewer parameters compared to state-of-the-art spectrogram transformers, and ablations highlight the importance of pretrained VMamba backbones over added transformer blocks or data augmentation. The work demonstrates the viability of SS-based architectures for large-scale audio tagging and suggests future directions in pretraining, self-supervised learning, and fuller exploitation of Mamba architectures for audio.
Abstract
Audio tagging is an important task of mapping audio samples to their corresponding categories. Recently endeavours that exploit transformer models in this field have achieved great success. However, the quadratic self-attention cost limits the scaling of audio transformer models and further constrains the development of more universal audio models. In this paper, we attempt to solve this problem by proposing Audio Mamba, a self-attention-free approach that captures long audio spectrogram dependency with state space models. Our experimental results on two audio-tagging datasets demonstrate the parameter efficiency of Audio Mamba, it achieves comparable results to SOTA audio spectrogram transformers with one third parameters.
