Table of Contents
Fetching ...

State Space Models for Bioacoustics: A comparative Evaluation with Transformers

Chengyu Tang, Sanjeev Baskiyar

TL;DR

BioMamba investigates a Mamba-based, state-space model approach for bioacoustic tasks, addressing data scarcity and computational demands of Transformer architectures. The model uses a wav2vec-style CNN feature extractor and a stacked Mamba encoder, trained with two-phase self-supervised pre-training and then fine-tuned on the BEANS bioacoustics benchmark. Results show BioMamba achieves competitive accuracy with AVES while reducing memory usage by around 40% in VRAM, highlighting improved efficiency for long-sequence audio and field deployment. This work establishes Mamba as a viable alternative to Transformers in bioacoustics and lays groundwork for scalable, memory-efficient environmental monitoring systems.

Abstract

In this study, we evaluate the efficacy of the Mamba model in the field of bioacoustics. We first pretrain a Mamba-based audio large language model (LLM) on a large corpus of audio data using self-supervised learning. We fine-tune and evaluate BioMamba on the BEANS benchmark, a collection of diverse bioacoustic tasks including classification and detection, and compare its performance and efficiency with multiple baseline models, including AVES, a state-of-the-art Transformer-based model. The results show that BioMamba achieves comparable performance with AVES while consumption significantly less VRAM, demonstrating its potential in this domain.

State Space Models for Bioacoustics: A comparative Evaluation with Transformers

TL;DR

BioMamba investigates a Mamba-based, state-space model approach for bioacoustic tasks, addressing data scarcity and computational demands of Transformer architectures. The model uses a wav2vec-style CNN feature extractor and a stacked Mamba encoder, trained with two-phase self-supervised pre-training and then fine-tuned on the BEANS bioacoustics benchmark. Results show BioMamba achieves competitive accuracy with AVES while reducing memory usage by around 40% in VRAM, highlighting improved efficiency for long-sequence audio and field deployment. This work establishes Mamba as a viable alternative to Transformers in bioacoustics and lays groundwork for scalable, memory-efficient environmental monitoring systems.

Abstract

In this study, we evaluate the efficacy of the Mamba model in the field of bioacoustics. We first pretrain a Mamba-based audio large language model (LLM) on a large corpus of audio data using self-supervised learning. We fine-tune and evaluate BioMamba on the BEANS benchmark, a collection of diverse bioacoustic tasks including classification and detection, and compare its performance and efficiency with multiple baseline models, including AVES, a state-of-the-art Transformer-based model. The results show that BioMamba achieves comparable performance with AVES while consumption significantly less VRAM, demonstrating its potential in this domain.

Paper Structure

This paper contains 17 sections, 1 equation, 1 figure, 1 table.

Figures (1)

  • Figure 1: Memory usage comparison between AVES and BioMamba