Table of Contents
Fetching ...

SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model

Siavash Shams, Sukru Samet Dindar, Xilin Jiang, Nima Mesgarani

TL;DR

SSAMBA presents a first self-supervised, attention-free audio representation learning approach based on a bidirectional Mamba state space model. By processing spectrogram patches with a patch-based MSPM framework and a mixed discriminative-generative objective, it achieves competitive or superior downstream performance relative to SSAST while markedly reducing inference time and memory usage. The method is pretrained on large unlabeled corpora (AudioSet-2M and LibriSpeech) and demonstrates robust results across diverse tasks, including audio event classification, keyword spotting, speaker identification, and emotion recognition. The work highlights the practical impact of subquadratic, content-aware state space modeling for scalable, efficient audio representation learning, with strong potential for edge- and cloud-based deployment.

Abstract

Transformers have revolutionized deep learning across various tasks, including audio representation learning, due to their powerful modeling capabilities. However, they often suffer from quadratic complexity in both GPU memory usage and computational inference time, affecting their efficiency. Recently, state space models (SSMs) like Mamba have emerged as a promising alternative, offering a more efficient approach by avoiding these complexities. Given these advantages, we explore the potential of SSM-based models in audio tasks. In this paper, we introduce Self-Supervised Audio Mamba (SSAMBA), the first self-supervised, attention-free, and SSM-based model for audio representation learning. SSAMBA leverages the bidirectional Mamba to capture complex audio patterns effectively. We incorporate a self-supervised pretraining framework that optimizes both discriminative and generative objectives, enabling the model to learn robust audio representations from large-scale, unlabeled datasets. We evaluated SSAMBA on various tasks such as audio classification, keyword spotting, and speaker identification. Our results demonstrate that SSAMBA outperforms the Self-Supervised Audio Spectrogram Transformer (SSAST) in most tasks. Notably, SSAMBA is approximately 92.7% faster in batch inference speed and 95.4% more memory-efficient than SSAST for the tiny model size with an input token size of 22k. These efficiency gains, combined with superior performance, underscore the effectiveness of SSAMBA's architectural innovation, making it a compelling choice for a wide range of audio processing applications.

SSAMBA: Self-Supervised Audio Representation Learning with Mamba State Space Model

TL;DR

SSAMBA presents a first self-supervised, attention-free audio representation learning approach based on a bidirectional Mamba state space model. By processing spectrogram patches with a patch-based MSPM framework and a mixed discriminative-generative objective, it achieves competitive or superior downstream performance relative to SSAST while markedly reducing inference time and memory usage. The method is pretrained on large unlabeled corpora (AudioSet-2M and LibriSpeech) and demonstrates robust results across diverse tasks, including audio event classification, keyword spotting, speaker identification, and emotion recognition. The work highlights the practical impact of subquadratic, content-aware state space modeling for scalable, efficient audio representation learning, with strong potential for edge- and cloud-based deployment.

Abstract

Transformers have revolutionized deep learning across various tasks, including audio representation learning, due to their powerful modeling capabilities. However, they often suffer from quadratic complexity in both GPU memory usage and computational inference time, affecting their efficiency. Recently, state space models (SSMs) like Mamba have emerged as a promising alternative, offering a more efficient approach by avoiding these complexities. Given these advantages, we explore the potential of SSM-based models in audio tasks. In this paper, we introduce Self-Supervised Audio Mamba (SSAMBA), the first self-supervised, attention-free, and SSM-based model for audio representation learning. SSAMBA leverages the bidirectional Mamba to capture complex audio patterns effectively. We incorporate a self-supervised pretraining framework that optimizes both discriminative and generative objectives, enabling the model to learn robust audio representations from large-scale, unlabeled datasets. We evaluated SSAMBA on various tasks such as audio classification, keyword spotting, and speaker identification. Our results demonstrate that SSAMBA outperforms the Self-Supervised Audio Spectrogram Transformer (SSAST) in most tasks. Notably, SSAMBA is approximately 92.7% faster in batch inference speed and 95.4% more memory-efficient than SSAST for the tiny model size with an input token size of 22k. These efficiency gains, combined with superior performance, underscore the effectiveness of SSAMBA's architectural innovation, making it a compelling choice for a wide range of audio processing applications.
Paper Structure (20 sections, 9 equations, 2 figures, 4 tables, 1 algorithm)

This paper contains 20 sections, 9 equations, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: A top-down view of Self-Supervised Audio Mamba
  • Figure 2: (a) Inference Time and (b) GPU Memory Usage for different model types and sizes