Table of Contents
Fetching ...

Deep Active Audio Feature Learning in Resource-Constrained Environments

Md Mohaimenuzzaman, Christoph Bergmeir, Bernd Meyer

TL;DR

This work tackles data scarcity in bioacoustic classification by introducing DAFL, a framework that embeds feature extraction within the active-learning loop and updates the extractor after each annotation round, while operating on raw audio. By leveraging BADGE for informative-sample selection and re-training the feature extractor alongside the classifier, DAFL achieves significant reductions in labeling effort across ESC-50, UrbanSound8K, and InsectWingBeat, with demonstrated benefits on microcontroller-based edge devices. The approach is validated on large and compact models and shown to retain performance in a real-world conservation setting, underscoring its practical impact for scalable, in-field acoustic monitoring. The work concludes with plans to extend integrated edge-AI systems for autonomous data collection and continual learning, highlighting a path toward deployable, continuously improving bioacoustic classifiers.

Abstract

The scarcity of labelled data makes training Deep Neural Network (DNN) models in bioacoustic applications challenging. In typical bioacoustics applications, manually labelling the required amount of data can be prohibitively expensive. To effectively identify both new and current classes, DNN models must continue to learn new features from a modest amount of fresh data. Active Learning (AL) is an approach that can help with this learning while requiring little labelling effort. Nevertheless, the use of fixed feature extraction approaches limits feature quality, resulting in underutilization of the benefits of AL. We describe an AL framework that addresses this issue by incorporating feature extraction into the AL loop and refining the feature extractor after each round of manual annotation. In addition, we use raw audio processing rather than spectrograms, which is a novel approach. Experiments reveal that the proposed AL framework requires 14.3%, 66.7%, and 47.4% less labelling effort on benchmark audio datasets ESC-50, UrbanSound8k, and InsectWingBeat, respectively, for a large DNN model and similar savings on a microcontroller-based counterpart. Furthermore, we showcase the practical relevance of our study by incorporating data from conservation biology projects. All codes are publicly available on GitHub.

Deep Active Audio Feature Learning in Resource-Constrained Environments

TL;DR

This work tackles data scarcity in bioacoustic classification by introducing DAFL, a framework that embeds feature extraction within the active-learning loop and updates the extractor after each annotation round, while operating on raw audio. By leveraging BADGE for informative-sample selection and re-training the feature extractor alongside the classifier, DAFL achieves significant reductions in labeling effort across ESC-50, UrbanSound8K, and InsectWingBeat, with demonstrated benefits on microcontroller-based edge devices. The approach is validated on large and compact models and shown to retain performance in a real-world conservation setting, underscoring its practical impact for scalable, in-field acoustic monitoring. The work concludes with plans to extend integrated edge-AI systems for autonomous data collection and continual learning, highlighting a path toward deployable, continuously improving bioacoustic classifiers.

Abstract

The scarcity of labelled data makes training Deep Neural Network (DNN) models in bioacoustic applications challenging. In typical bioacoustics applications, manually labelling the required amount of data can be prohibitively expensive. To effectively identify both new and current classes, DNN models must continue to learn new features from a modest amount of fresh data. Active Learning (AL) is an approach that can help with this learning while requiring little labelling effort. Nevertheless, the use of fixed feature extraction approaches limits feature quality, resulting in underutilization of the benefits of AL. We describe an AL framework that addresses this issue by incorporating feature extraction into the AL loop and refining the feature extractor after each round of manual annotation. In addition, we use raw audio processing rather than spectrograms, which is a novel approach. Experiments reveal that the proposed AL framework requires 14.3%, 66.7%, and 47.4% less labelling effort on benchmark audio datasets ESC-50, UrbanSound8k, and InsectWingBeat, respectively, for a large DNN model and similar savings on a microcontroller-based counterpart. Furthermore, we showcase the practical relevance of our study by incorporating data from conservation biology projects. All codes are publicly available on GitHub.
Paper Structure (22 sections, 1 equation, 15 figures, 13 tables, 1 algorithm)

This paper contains 22 sections, 1 equation, 15 figures, 13 tables, 1 algorithm.

Figures (15)

  • Figure 1: The detailed architecture of the proposed dafl where the red arrow indicates how the feature extractor is incorporated in the al loop. Conventional systems use the black path instead of the red path for the active learning loop.
  • Figure 2: ACDNet architecture for an input length of 30,225, with $k$ and $f$ representing kernel size and number of filters, respectively, and $n \in \{1, \ldots, 4\}$. The height ($h$) and width ($w$) represent frequency and time resolution, respectively.
  • Figure 3: Fine-tuning ACDNet in different settings (no-freeze, fixed-freeze and scheduled-freeze) for incremental learning. Please see Table \ref{['tab:al_acdnet_freeze_nofreeze_esc50']} for the data used to generate this plot.
  • Figure 4: dicl vs dal vs dafl on esc50 where alknc, allgr and alridgec are dal methods.
  • Figure 5: CD diagram showing statistical significance of the learning methods on the esc50 dataset.
  • ...and 10 more figures