Semi-supervised classification of bird vocalizations
Simen Hexeberg, Mandar Chitre, Matthias Hoffmann-Kuhnt, Bing Wen Low
TL;DR
This work tackles scalable, long-term monitoring of bird communities through passive acoustics under limited labeled data and dense soundscapes. It introduces a semi-supervised pipeline that combines segmentation, a convolutional auto-encoder, contrastive representation learning, and a supervised classifier, enabling detection of time-overlapping calls when they are separable in frequency. On held-out test data, the method achieves a mean $F_{0.5}$ of $0.701$ across 315 classes from 110 species and outperforms BirdNET on a 103-species test with far fewer labeled samples, while also performing robustly on 144 hours of continuous Singapore soundscape data. The approach reduces labeling burden, supports efficient clustering and annotation of new classes, and is applicable to broader acoustic tasks involving frequency-modulated signals in complex environments.
Abstract
Changes in bird populations can indicate broader changes in ecosystems, making birds one of the most important animal groups to monitor. Combining machine learning and passive acoustics enables continuous monitoring over extended periods without direct human involvement. However, most existing techniques require extensive expert-labeled datasets for training and cannot easily detect time-overlapping calls in busy soundscapes. We propose a semi-supervised acoustic bird detector designed to allow both the detection of time-overlapping calls (when separated in frequency) and the use of few labeled training samples. The classifier is trained and evaluated on a combination of community-recorded open-source data and long-duration soundscape recordings from Singapore. It achieves a mean F0.5 score of 0.701 across 315 classes from 110 bird species on a hold-out test set, with an average of 11 labeled training samples per class. It outperforms the state-of-the-art BirdNET classifier on a test set of 103 bird species despite significantly fewer labeled training samples. The detector is further tested on 144 microphone-hours of continuous soundscape data. The rich soundscape in Singapore makes suppression of false positives a challenge on raw, continuous data streams. Nevertheless, we demonstrate that achieving high precision in such environments with minimal labeled training data is possible.
