Table of Contents
Fetching ...

Clustering and novel class recognition: evaluating bioacoustic deep learning feature extractors

Vincent S. Kather, Burooj Ghani, Dan Stowell

TL;DR

This work addresses the limitation of relying solely on classifier benchmarks by analyzing the embeddings produced by bioacoustic feature extractors. By isolating 15 pretrained extractors across supervised and self-supervised paradigms and evaluating them on two challenging PAM datasets (bird and frog vocalizations) through clustering and kNN classification, the study reveals distinct patterns: supervised bird-trained extractors achieve the strongest clustering and, often, classification on in-domain bird data, while self-supervised AVES models show superior clustering and classification for cross-domain frog data. The authors also demonstrate that applying UMAP to embeddings improves clustering, and they provide a reproducible embedding-space evaluation workflow (bacpipe) to compare models beyond their classifiers. Overall, the work highlights the importance of training-domain alignment and embedding-space analysis for robust bioacoustic model deployment in diverse, noisy, polyphonic environments.

Abstract

In computational bioacoustics, deep learning models are composed of feature extractors and classifiers. The feature extractors generate vector representations of the input sound segments, called embeddings, which can be input to a classifier. While benchmarking of classification scores provides insights into specific performance statistics, it is limited to species that are included in the models' training data. Furthermore, it makes it impossible to compare models trained on very different taxonomic groups. This paper aims to address this gap by analyzing the embeddings generated by the feature extractors of 15 bioacoustic models spanning a wide range of setups (model architectures, training data, training paradigms). We evaluated and compared different ways in which models structure embedding spaces through clustering and kNN classification, which allows us to focus our comparison on feature extractors independent of their classifiers. We believe that this approach lets us evaluate the adaptability and generalization potential of models going beyond the classes they were trained on.

Clustering and novel class recognition: evaluating bioacoustic deep learning feature extractors

TL;DR

This work addresses the limitation of relying solely on classifier benchmarks by analyzing the embeddings produced by bioacoustic feature extractors. By isolating 15 pretrained extractors across supervised and self-supervised paradigms and evaluating them on two challenging PAM datasets (bird and frog vocalizations) through clustering and kNN classification, the study reveals distinct patterns: supervised bird-trained extractors achieve the strongest clustering and, often, classification on in-domain bird data, while self-supervised AVES models show superior clustering and classification for cross-domain frog data. The authors also demonstrate that applying UMAP to embeddings improves clustering, and they provide a reproducible embedding-space evaluation workflow (bacpipe) to compare models beyond their classifiers. Overall, the work highlights the importance of training-domain alignment and embedding-space analysis for robust bioacoustic model deployment in diverse, noisy, polyphonic environments.

Abstract

In computational bioacoustics, deep learning models are composed of feature extractors and classifiers. The feature extractors generate vector representations of the input sound segments, called embeddings, which can be input to a classifier. While benchmarking of classification scores provides insights into specific performance statistics, it is limited to species that are included in the models' training data. Furthermore, it makes it impossible to compare models trained on very different taxonomic groups. This paper aims to address this gap by analyzing the embeddings generated by the feature extractors of 15 bioacoustic models spanning a wide range of setups (model architectures, training data, training paradigms). We evaluated and compared different ways in which models structure embedding spaces through clustering and kNN classification, which allows us to focus our comparison on feature extractors independent of their classifiers. We believe that this approach lets us evaluate the adaptability and generalization potential of models going beyond the classes they were trained on.

Paper Structure

This paper contains 9 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Comparison of feature extractor performance by learning paradigm, training data and application data. The top plot shows clustering results of AMI. The bottom plot shows macro accuracy results of kNN classification. Colors correspond to performance of the feature extractors when applied to bird data (blue) and frog data (green). The x-axis shows abbreviated model names, corresponding to "abbrev." column in Tab. \ref{['tab:bacpipe_models']}. Models are grouped into categories along the x-axis by training paradigm (supervised learning (supl) and self-supervised learning (ssl)) and training data (bird data and non-bird data, see Table \ref{['tab:bacpipe_models']}).
  • Figure 2: Two-dimensional embedding spaces of all feature extractors applied to bird data, sorted descending by their clustering performance of AMI values (indicated next to their name) from top left to bottom right. Colors correspond to the class labels, which are 11 different tropical bird species.