Table of Contents
Fetching ...

BioME: A Resource-Efficient Bioacoustic Foundational Model for IoT Applications

Heitor R. Guimarães, Abhishek Tiwari, Mahsa Abdollahi, Anderson R. Avila, Tiago H. Falk

TL;DR

BioME addresses the challenge of deploying bioacoustic encoders on resource-constrained IoT devices. It achieves this by distilling a high-capacity BEATs teacher into a compact Transformer with GQA and RoPE, augmented by modulation-spectrum features injected via FiLM and trained on multi-domain data. The approach yields state-of-the-art or competitive results on BEANS and acoustic beehive monitoring benchmarks while enabling edge deployment through significantly reduced parameters and memory requirements. The findings demonstrate that DSP-inspired inductive biases and layer-wise distillation can produce highly discriminative representations for diverse ecological tasks, enabling scalable, in-the-wild PAM.

Abstract

Passive acoustic monitoring has become a key strategy in biodiversity assessment, conservation, and behavioral ecology, especially as Internet-of-Things (IoT) devices enable continuous in situ audio collection at scale. While recent self-supervised learning (SSL)-based audio encoders, such as BEATs and AVES, have shown strong performance in bioacoustic tasks, their computational cost and limited robustness to unseen environments hinder deployment on resource-constrained platforms. In this work, we introduce BioME, a resource-efficient audio encoder designed for bioacoustic applications. BioME is trained via layer-to-layer distillation from a high-capacity teacher model, enabling strong representational transfer while reducing the parameter count by 75%. To further improve ecological generalization, the model is pretrained on multi-domain data spanning speech, environmental sounds, and animal vocalizations. A key contribution is the integration of modulation-aware acoustic features via FiLM conditioning, injecting a DSP-inspired inductive bias that enhances feature disentanglement in low-capacity regimes. Across multiple bioacoustic tasks, BioME matches or surpasses the performance of larger models, including its teacher, while being suitable for resource-constrained IoT deployments. For reproducibility, code and pretrained checkpoints are publicly available.

BioME: A Resource-Efficient Bioacoustic Foundational Model for IoT Applications

TL;DR

BioME addresses the challenge of deploying bioacoustic encoders on resource-constrained IoT devices. It achieves this by distilling a high-capacity BEATs teacher into a compact Transformer with GQA and RoPE, augmented by modulation-spectrum features injected via FiLM and trained on multi-domain data. The approach yields state-of-the-art or competitive results on BEANS and acoustic beehive monitoring benchmarks while enabling edge deployment through significantly reduced parameters and memory requirements. The findings demonstrate that DSP-inspired inductive biases and layer-wise distillation can produce highly discriminative representations for diverse ecological tasks, enabling scalable, in-the-wild PAM.

Abstract

Passive acoustic monitoring has become a key strategy in biodiversity assessment, conservation, and behavioral ecology, especially as Internet-of-Things (IoT) devices enable continuous in situ audio collection at scale. While recent self-supervised learning (SSL)-based audio encoders, such as BEATs and AVES, have shown strong performance in bioacoustic tasks, their computational cost and limited robustness to unseen environments hinder deployment on resource-constrained platforms. In this work, we introduce BioME, a resource-efficient audio encoder designed for bioacoustic applications. BioME is trained via layer-to-layer distillation from a high-capacity teacher model, enabling strong representational transfer while reducing the parameter count by 75%. To further improve ecological generalization, the model is pretrained on multi-domain data spanning speech, environmental sounds, and animal vocalizations. A key contribution is the integration of modulation-aware acoustic features via FiLM conditioning, injecting a DSP-inspired inductive bias that enhances feature disentanglement in low-capacity regimes. Across multiple bioacoustic tasks, BioME matches or surpasses the performance of larger models, including its teacher, while being suitable for resource-constrained IoT deployments. For reproducibility, code and pretrained checkpoints are publicly available.
Paper Structure (25 sections, 1 equation, 7 figures, 3 tables)

This paper contains 25 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Spectrogram (top row) and the modulation spectrum (bottom row) plots, averaged across samples, for three hummingbird species: Anna’s hummingbird (annhum), Broad-tailed hummingbird (brthum), and Costa’s hummingbird (coshum).
  • Figure 2: Block diagram of a single Transformer layer in the proposed BioME encoder. Patch embeddings are processed alongside side-channel context features, which are integrated at each layer through the Conditioner module implementing FiLM-based conditioning.
  • Figure 3: Modulation Spectrogram Average Bands (MSAB) computation. For example, a $256 \times 256$ modulation spectrogram produces a $512$-dimensional MSAB feature vector.
  • Figure 4: Ablation studies for the proposed BioME. We analyze (a) the student architecture selection, (b) the effect of spectral resolution (NFFT size) on the MSAB features, and (c) model scalability (Edge vs. Small vs. Base). Results are averaged across the classification, detection, and auxiliary tasks of the BEANS benchmark. The 'Avg.' column is the average across all bioacoustic tasks (excluding auxiliary tasks).
  • Figure 5: Plot showing the between parameter efficiency, scaling properties, and training data partition.
  • ...and 2 more figures