Table of Contents
Fetching ...

Aggregation Strategies for Efficient Annotation of Bioacoustic Sound Events Using Active Learning

Richard Lindholm, Oscar Marklund, Olof Mogren, John Martinsson

TL;DR

This work tackles the annotation bottleneck in bioacoustic SED by introducing Top $K$ Entropy, an uncertainty-aggregation method that prioritizes the most uncertain segments within a file and translates them into file-level queries for active learning. Using a pool-based loop with a segment-based SED leveraging $1024$-D YAMNet features, the approach achieves near-fully supervised performance while requiring only about $8\%$ of labels, yielding up to a $92\%$ reduction in annotation effort. The method proves robust to noise variability, generalizes across different event ratios, and demonstrates potential for scalable bioacoustic monitoring in diverse ecological settings. These results highlight the practical value of uncertainty-focused querying for time-series annotation tasks beyond bioacoustics.

Abstract

The vast amounts of audio data collected in Sound Event Detection (SED) applications require efficient annotation strategies to enable supervised learning. Manual labeling is expensive and time-consuming, making Active Learning (AL) a promising approach for reducing annotation effort. We introduce Top K Entropy, a novel uncertainty aggregation strategy for AL that prioritizes the most uncertain segments within an audio recording, instead of averaging uncertainty across all segments. This approach enables the selection of entire recordings for annotation, improving efficiency in sparse data scenarios. We compare Top K Entropy to random sampling and Mean Entropy, and show that fewer labels can lead to the same model performance, particularly in datasets with sparse sound events. Evaluations are conducted on audio mixtures of sound recordings from parks with meerkat, dog, and baby crying sound events, representing real-world bioacoustic monitoring scenarios. Using Top K Entropy for active learning, we can achieve comparable performance to training on the fully labeled dataset with only 8% of the labels. Top K Entropy outperforms Mean Entropy, suggesting that it is best to let the most uncertain segments represent the uncertainty of an audio file. The findings highlight the potential of AL for scalable annotation in audio and time-series applications, including bioacoustics.

Aggregation Strategies for Efficient Annotation of Bioacoustic Sound Events Using Active Learning

TL;DR

This work tackles the annotation bottleneck in bioacoustic SED by introducing Top Entropy, an uncertainty-aggregation method that prioritizes the most uncertain segments within a file and translates them into file-level queries for active learning. Using a pool-based loop with a segment-based SED leveraging -D YAMNet features, the approach achieves near-fully supervised performance while requiring only about of labels, yielding up to a reduction in annotation effort. The method proves robust to noise variability, generalizes across different event ratios, and demonstrates potential for scalable bioacoustic monitoring in diverse ecological settings. These results highlight the practical value of uncertainty-focused querying for time-series annotation tasks beyond bioacoustics.

Abstract

The vast amounts of audio data collected in Sound Event Detection (SED) applications require efficient annotation strategies to enable supervised learning. Manual labeling is expensive and time-consuming, making Active Learning (AL) a promising approach for reducing annotation effort. We introduce Top K Entropy, a novel uncertainty aggregation strategy for AL that prioritizes the most uncertain segments within an audio recording, instead of averaging uncertainty across all segments. This approach enables the selection of entire recordings for annotation, improving efficiency in sparse data scenarios. We compare Top K Entropy to random sampling and Mean Entropy, and show that fewer labels can lead to the same model performance, particularly in datasets with sparse sound events. Evaluations are conducted on audio mixtures of sound recordings from parks with meerkat, dog, and baby crying sound events, representing real-world bioacoustic monitoring scenarios. Using Top K Entropy for active learning, we can achieve comparable performance to training on the fully labeled dataset with only 8% of the labels. Top K Entropy outperforms Mean Entropy, suggesting that it is best to let the most uncertain segments represent the uncertainty of an audio file. The findings highlight the potential of AL for scalable annotation in audio and time-series applications, including bioacoustics.

Paper Structure

This paper contains 14 sections, 4 figures.

Figures (4)

  • Figure 1: Top $K$ Entropy uncertainty aggregation selects the top $K$ segment entropies (here $K$=3) obtained from the segments of the file. The resulting uncertainty for the file is the average of the selected segment entropies.
  • Figure 2: Total IoU performance of different aggregation strategies and the Random Querying baseline, averaged over 5 seeds. Top K Entropy achieves comparable IoU to a fully supervised model while reducing annotation effort by 92%. Results based on data generated with event ratio $r=0.2$ and SNR $=0$.
  • Figure 3: Total IoU results averaged over 20 seeds (N=20) for Mean Entropy and Top 10 Entropy. The results for Random Querying baseline are averaged over 20 seeds, with five different initializations for the dataset. Each line is paired with 95% confidence intervals. Results based on data generated with event ratio $r=0.2$ and SNR $=0$.
  • Figure 4: Total IoU performance averaged over 5 seeds (N=5), for 6 different fixed values of $K$ for the querying strategy Top $K$ Entropy) along with the baseline strategy. Results based on data generated with event ratio $r=0.2$ and SNR $=0$.