Table of Contents
Fetching ...

Uncertainty Calibration of Multi-Label Bird Sound Classifiers

Raphael Schwinger, Ben McEwen, Vincent S. Kather, René Heinrich, Lukas Rauch, Sven Tomforde

TL;DR

This work tackles uncertainty calibration for multi-label bird sound classifiers used in passive acoustic monitoring. It benchmarks four state-of-the-art models on BirdSet across global, per-dataset, and per-class calibration using threshold-free metrics and cmAP, and investigates post hoc methods like temperature and Platt scaling. The results show substantial calibration variability across datasets and classes, with Perch v2 and ConvNeXt_BS generally underconfident and AudioProtoPNet/BirdMAE often overconfident; calibration can even be better for rarer classes. The study demonstrates that lightweight, deployment-specific calibration using small calibration sets can substantially improve reliability, underscoring the need to evaluate and tailor calibration for bioacoustic deployment contexts.

Abstract

Passive acoustic monitoring enables large-scale biodiversity assessment, but reliable classification of bioacoustic sounds requires not only high accuracy but also well-calibrated uncertainty estimates to ground decision-making. In bioacoustics, calibration is challenged by overlapping vocalisations, long-tailed species distributions, and distribution shifts between training and deployment data. The calibration of multi-label deep learning classifiers within the domain of bioacoustics has not yet been assessed. We systematically benchmark the calibration of four state-of-the-art multi-label bird sound classifiers on the BirdSet benchmark, evaluating both global, per-dataset and per-class calibration using threshold-free calibration metrics (ECE, MCS) alongside discrimination metrics (cmAP). Model calibration varies significantly across datasets and classes. While Perch v2 and ConvNeXt$_{BS}$ show better global calibration, results vary between datasets. Both models indicate consistent underconfidence, while AudioProtoPNet and BirdMAE are mostly overconfident. Surprisingly, calibration seems to be better for less frequent classes. Using simple post hoc calibration methods we demonstrate a straightforward way to improve calibration. A small labelled calibration set is sufficient to significantly improve calibration with Platt scaling, while global calibration parameters suffer from dataset variability. Our findings highlight the importance of evaluating and improving uncertainty calibration in bioacoustic classifiers.

Uncertainty Calibration of Multi-Label Bird Sound Classifiers

TL;DR

This work tackles uncertainty calibration for multi-label bird sound classifiers used in passive acoustic monitoring. It benchmarks four state-of-the-art models on BirdSet across global, per-dataset, and per-class calibration using threshold-free metrics and cmAP, and investigates post hoc methods like temperature and Platt scaling. The results show substantial calibration variability across datasets and classes, with Perch v2 and ConvNeXt_BS generally underconfident and AudioProtoPNet/BirdMAE often overconfident; calibration can even be better for rarer classes. The study demonstrates that lightweight, deployment-specific calibration using small calibration sets can substantially improve reliability, underscoring the need to evaluate and tailor calibration for bioacoustic deployment contexts.

Abstract

Passive acoustic monitoring enables large-scale biodiversity assessment, but reliable classification of bioacoustic sounds requires not only high accuracy but also well-calibrated uncertainty estimates to ground decision-making. In bioacoustics, calibration is challenged by overlapping vocalisations, long-tailed species distributions, and distribution shifts between training and deployment data. The calibration of multi-label deep learning classifiers within the domain of bioacoustics has not yet been assessed. We systematically benchmark the calibration of four state-of-the-art multi-label bird sound classifiers on the BirdSet benchmark, evaluating both global, per-dataset and per-class calibration using threshold-free calibration metrics (ECE, MCS) alongside discrimination metrics (cmAP). Model calibration varies significantly across datasets and classes. While Perch v2 and ConvNeXt show better global calibration, results vary between datasets. Both models indicate consistent underconfidence, while AudioProtoPNet and BirdMAE are mostly overconfident. Surprisingly, calibration seems to be better for less frequent classes. Using simple post hoc calibration methods we demonstrate a straightforward way to improve calibration. A small labelled calibration set is sufficient to significantly improve calibration with Platt scaling, while global calibration parameters suffer from dataset variability. Our findings highlight the importance of evaluating and improving uncertainty calibration in bioacoustic classifiers.

Paper Structure

This paper contains 47 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Calibration evaluation of AudioProtoPNet, BirdMAE, ConvNeXt$_{BS}$ and Perch v2 globally on all BirdSet test samples visualised in a reliability diagram (Section \ref{['sec:uncertainty_and_calibration']}). Mean predicted probabilities are shown on the x-axis, while empirical frequencies are shown on the y-axis. The miscalibration score (MCS) is outlined for each model.
  • Figure 2: Effects of Platt scaling on calibrated confidence scores $\hat{p}_c^{PS}$ as a function of logits $z_c$: (a) varying temperature parameter $T$ with fixed bias $b=0$, (b) varying bias $b$ with fixed temperature parameter $T=1$, and (c) joint variation of $T$ and $b$. The uncalibrated baseline $(T=1,\,b=0)$ is shown in black (solid) in all panels.
  • Figure 3: Demonstrating model (ConvNeXt) calibration performance across individual BirdSet evaluation datasets and the combined performance when aggregated across all datasets.
  • Figure 4: Class-wise ECE scores over number of samples per class for different models shown by colour.
  • Figure 5: Reliability diagrams investigating calibration for common $\uparrow$ and rare $\downarrow$ subsets for different models. MCS value for all models and subsets are noted.