Table of Contents
Fetching ...

Foundation Models for Bioacoustics -- a Comparative Review

Raphael Schwinger, Paria Vali Zadeh, Lukas Rauch, Mats Kurz, Tom Hauschild, Sam Lapp, Sven Tomforde

Abstract

Automated bioacoustic analysis is essential for biodiversity monitoring and conservation, requiring advanced deep learning models that can adapt to diverse bioacoustic tasks. This article presents a comprehensive review of large-scale pretrained bioacoustic foundation models and systematically investigates their transferability across multiple bioacoustic classification tasks. We overview bioacoustic representation learning by analysing pretraining data sources and benchmarks. On this basis, we review bioacoustic foundation models, dissecting the models' training data, preprocessing, augmentations, architecture, and training paradigm. Additionally, we conduct an extensive empirical study of selected models on the BEANS and BirdSet benchmarks, evaluating generalisability under linear and attentive probing. Our experimental analysis reveals that Perch~2.0 achieves the highest BirdSet score (restricted evaluation) and the strongest linear probing result on BEANS, building on diverse multi-taxa supervised pretraining; that BirdMAE is the best model among probing-based strategies on BirdSet and second on BEANS after BEATs$_{NLM}$, the encoder of NatureLM-audio; that attentive probing is beneficial to extract the full performance of transformer-based models; and that general-purpose audio models trained with self-supervised learning on AudioSet outperform many specialised bird sound models on BEANS when evaluated with attentive probing. These findings provide valuable guidance for practitioners selecting appropriate models to adapt them to new bioacoustic classification tasks via probing.

Foundation Models for Bioacoustics -- a Comparative Review

Abstract

Automated bioacoustic analysis is essential for biodiversity monitoring and conservation, requiring advanced deep learning models that can adapt to diverse bioacoustic tasks. This article presents a comprehensive review of large-scale pretrained bioacoustic foundation models and systematically investigates their transferability across multiple bioacoustic classification tasks. We overview bioacoustic representation learning by analysing pretraining data sources and benchmarks. On this basis, we review bioacoustic foundation models, dissecting the models' training data, preprocessing, augmentations, architecture, and training paradigm. Additionally, we conduct an extensive empirical study of selected models on the BEANS and BirdSet benchmarks, evaluating generalisability under linear and attentive probing. Our experimental analysis reveals that Perch~2.0 achieves the highest BirdSet score (restricted evaluation) and the strongest linear probing result on BEANS, building on diverse multi-taxa supervised pretraining; that BirdMAE is the best model among probing-based strategies on BirdSet and second on BEANS after BEATs, the encoder of NatureLM-audio; that attentive probing is beneficial to extract the full performance of transformer-based models; and that general-purpose audio models trained with self-supervised learning on AudioSet outperform many specialised bird sound models on BEANS when evaluated with attentive probing. These findings provide valuable guidance for practitioners selecting appropriate models to adapt them to new bioacoustic classification tasks via probing.

Paper Structure

This paper contains 51 sections, 1 equation, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Taxonomy distribution (logarithmic scale) of the large bioacoustic data platforms—Xeno-Canto (XC), Macaulay Library (MAC), iNaturalist (INA), and Animal Sound Archive (ASA) —across five widely studied biological groups: Birds, Amphibians, Mammals, Insects, and Reptiles noauthor_animal_nodatenoauthor_inaturalist_nodatenoauthor_macaulay_nodatevellinga_xeno-canto_2015.
  • Figure 2: A sample from the PER dataset of BirdSet rauch2025birdset and from the Dogs of BEANS hagiwara_beans_2023 preprocessed according to the preprocessing pipelines of ConvNext$_{BS}$rauch2025birdset and BEATs chen_beats_2022. The sample is displayed as a mel-spectrogram with the time dimension at the x-axis and the frequency dimension at the y-axis.
  • Figure 3: Comparison of the number of network parameters used for probing the BEATs model on HSN dataset (21 classes) Encoder (90.3M), Trainable parameters (1.22M), Attentive pooling (1.2M), Linear classifier (16.1k).
  • Figure 4: Overview of reported results of general audio models trained on AudioSet. The metric $mAP$ for AS-20k, and $Acc$ for ESC is used.