Table of Contents
Fetching ...

Few-Shot Bioacoustic Event Detection with Frame-Level Embedding Learning System

PengYuan Zhao, ChengWei Lu, Liang Zou

TL;DR

This work tackles few-shot bioacoustic event detection (FSBED) in ecological monitoring by proposing a frame-level embedding learning system that uses the NetMamba Encoder, a state-space model, to efficiently capture long-range dependencies. The approach combines multi-task frame-level training for sound event detection and foreground/background classification, leveraging log-mel and PCEN features along with data augmentation and targeted post-processing to boost generalization and robustness. It achieves a strong performance, with an F-measure of $56.4\%$, ranking 2nd in DCASE2024 Task 5, and outperforms the Log-Mel baseline, highlighting the practical potential of NetMamba-based sequence modeling for FSBED. Overall, the method demonstrates that efficient state-space sequence models, combined with careful data handling and post-processing, can deliver accurate, scalable BED in real-world ecological datasets.

Abstract

This technical report presents our frame-level embedding learning system for the DCASE2024 challenge for few-shot bioacoustic event detection (Task 5).In this work, we used log-mel and PCEN for feature extraction of the input audio, Netmamba Encoder as the information interaction network, and adopted data augmentation strategies to improve the generalizability of the trained model as well as multiple post-processing methods. Our final system achieved an F-measure score of 56.4%, securing the 2nd rank in the few-shot bioacoustic event detection category of the Detection and Classification of Acoustic Scenes and Events Challenge 2024.

Few-Shot Bioacoustic Event Detection with Frame-Level Embedding Learning System

TL;DR

This work tackles few-shot bioacoustic event detection (FSBED) in ecological monitoring by proposing a frame-level embedding learning system that uses the NetMamba Encoder, a state-space model, to efficiently capture long-range dependencies. The approach combines multi-task frame-level training for sound event detection and foreground/background classification, leveraging log-mel and PCEN features along with data augmentation and targeted post-processing to boost generalization and robustness. It achieves a strong performance, with an F-measure of , ranking 2nd in DCASE2024 Task 5, and outperforms the Log-Mel baseline, highlighting the practical potential of NetMamba-based sequence modeling for FSBED. Overall, the method demonstrates that efficient state-space sequence models, combined with careful data handling and post-processing, can deliver accurate, scalable BED in real-world ecological datasets.

Abstract

This technical report presents our frame-level embedding learning system for the DCASE2024 challenge for few-shot bioacoustic event detection (Task 5).In this work, we used log-mel and PCEN for feature extraction of the input audio, Netmamba Encoder as the information interaction network, and adopted data augmentation strategies to improve the generalizability of the trained model as well as multiple post-processing methods. Our final system achieved an F-measure score of 56.4%, securing the 2nd rank in the few-shot bioacoustic event detection category of the Detection and Classification of Acoustic Scenes and Events Challenge 2024.
Paper Structure (8 sections, 4 equations, 2 figures, 1 table)

This paper contains 8 sections, 4 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Training Framework
  • Figure 2: NetMamba Encoder