Table of Contents
Fetching ...

Audio Mamba: Pretrained Audio State Space Model For Audio Tagging

Jiaju Lin, Haoxuan Hu

TL;DR

The paper tackles the quadratic self-attention bottleneck in audio transformers by introducing AudioMamba, a self-attention-free architecture based on state-space models that achieves linear-time complexity $O(n)$ through a multi-stage SS backbone and patch embeddings. It blends HT-SAT-inspired spectrogram patching with VMamba-based backbones to enable scalable, efficient audio tagging. Experiments on AudioSet show AudioMamba variants achieve competitive performance with substantially fewer parameters compared to state-of-the-art spectrogram transformers, and ablations highlight the importance of pretrained VMamba backbones over added transformer blocks or data augmentation. The work demonstrates the viability of SS-based architectures for large-scale audio tagging and suggests future directions in pretraining, self-supervised learning, and fuller exploitation of Mamba architectures for audio.

Abstract

Audio tagging is an important task of mapping audio samples to their corresponding categories. Recently endeavours that exploit transformer models in this field have achieved great success. However, the quadratic self-attention cost limits the scaling of audio transformer models and further constrains the development of more universal audio models. In this paper, we attempt to solve this problem by proposing Audio Mamba, a self-attention-free approach that captures long audio spectrogram dependency with state space models. Our experimental results on two audio-tagging datasets demonstrate the parameter efficiency of Audio Mamba, it achieves comparable results to SOTA audio spectrogram transformers with one third parameters.

Audio Mamba: Pretrained Audio State Space Model For Audio Tagging

TL;DR

The paper tackles the quadratic self-attention bottleneck in audio transformers by introducing AudioMamba, a self-attention-free architecture based on state-space models that achieves linear-time complexity through a multi-stage SS backbone and patch embeddings. It blends HT-SAT-inspired spectrogram patching with VMamba-based backbones to enable scalable, efficient audio tagging. Experiments on AudioSet show AudioMamba variants achieve competitive performance with substantially fewer parameters compared to state-of-the-art spectrogram transformers, and ablations highlight the importance of pretrained VMamba backbones over added transformer blocks or data augmentation. The work demonstrates the viability of SS-based architectures for large-scale audio tagging and suggests future directions in pretraining, self-supervised learning, and fuller exploitation of Mamba architectures for audio.

Abstract

Audio tagging is an important task of mapping audio samples to their corresponding categories. Recently endeavours that exploit transformer models in this field have achieved great success. However, the quadratic self-attention cost limits the scaling of audio transformer models and further constrains the development of more universal audio models. In this paper, we attempt to solve this problem by proposing Audio Mamba, a self-attention-free approach that captures long audio spectrogram dependency with state space models. Our experimental results on two audio-tagging datasets demonstrate the parameter efficiency of Audio Mamba, it achieves comparable results to SOTA audio spectrogram transformers with one third parameters.
Paper Structure (10 sections, 3 equations, 2 figures, 2 tables)

This paper contains 10 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The patch embedding extraction process. The input spectrogram is first divided into a grid of non-overlapping patches. A Patch Embedding layer then maps each patch into a compact vector representation, capturing the local spectral features. These patch embeddings serve as the input to the subsequent SS blocks. This figure is a repurposed version of Figure 1 from our previous work htsat-ke2022.
  • Figure 2: (a) Architecture of Audio Mamba, a multi-stage audio state-space model. The input patch embeddings undergo a series of stages, with each stage applying a block-based downsampling operation to progressively capture features at different scales. The outputs of downsampling layer are then processed through various neural network layers, including a Feature Fusion Network (FFN), Linear (LN) layers, and a Spatial Squeeze and Dimensional (SS2D) block, before producing the final output. (d) Details of the SS Block, which is a core component within the AudioMamba architecture. These patch embeddings serve as the input to the subsequent SS blocks. This figure is a repurposed version of Figure 8 from vmamba.