Table of Contents
Fetching ...

Semi-supervised classification of bird vocalizations

Simen Hexeberg, Mandar Chitre, Matthias Hoffmann-Kuhnt, Bing Wen Low

TL;DR

This work tackles scalable, long-term monitoring of bird communities through passive acoustics under limited labeled data and dense soundscapes. It introduces a semi-supervised pipeline that combines segmentation, a convolutional auto-encoder, contrastive representation learning, and a supervised classifier, enabling detection of time-overlapping calls when they are separable in frequency. On held-out test data, the method achieves a mean $F_{0.5}$ of $0.701$ across 315 classes from 110 species and outperforms BirdNET on a 103-species test with far fewer labeled samples, while also performing robustly on 144 hours of continuous Singapore soundscape data. The approach reduces labeling burden, supports efficient clustering and annotation of new classes, and is applicable to broader acoustic tasks involving frequency-modulated signals in complex environments.

Abstract

Changes in bird populations can indicate broader changes in ecosystems, making birds one of the most important animal groups to monitor. Combining machine learning and passive acoustics enables continuous monitoring over extended periods without direct human involvement. However, most existing techniques require extensive expert-labeled datasets for training and cannot easily detect time-overlapping calls in busy soundscapes. We propose a semi-supervised acoustic bird detector designed to allow both the detection of time-overlapping calls (when separated in frequency) and the use of few labeled training samples. The classifier is trained and evaluated on a combination of community-recorded open-source data and long-duration soundscape recordings from Singapore. It achieves a mean F0.5 score of 0.701 across 315 classes from 110 bird species on a hold-out test set, with an average of 11 labeled training samples per class. It outperforms the state-of-the-art BirdNET classifier on a test set of 103 bird species despite significantly fewer labeled training samples. The detector is further tested on 144 microphone-hours of continuous soundscape data. The rich soundscape in Singapore makes suppression of false positives a challenge on raw, continuous data streams. Nevertheless, we demonstrate that achieving high precision in such environments with minimal labeled training data is possible.

Semi-supervised classification of bird vocalizations

TL;DR

This work tackles scalable, long-term monitoring of bird communities through passive acoustics under limited labeled data and dense soundscapes. It introduces a semi-supervised pipeline that combines segmentation, a convolutional auto-encoder, contrastive representation learning, and a supervised classifier, enabling detection of time-overlapping calls when they are separable in frequency. On held-out test data, the method achieves a mean of across 315 classes from 110 species and outperforms BirdNET on a 103-species test with far fewer labeled samples, while also performing robustly on 144 hours of continuous Singapore soundscape data. The approach reduces labeling burden, supports efficient clustering and annotation of new classes, and is applicable to broader acoustic tasks involving frequency-modulated signals in complex environments.

Abstract

Changes in bird populations can indicate broader changes in ecosystems, making birds one of the most important animal groups to monitor. Combining machine learning and passive acoustics enables continuous monitoring over extended periods without direct human involvement. However, most existing techniques require extensive expert-labeled datasets for training and cannot easily detect time-overlapping calls in busy soundscapes. We propose a semi-supervised acoustic bird detector designed to allow both the detection of time-overlapping calls (when separated in frequency) and the use of few labeled training samples. The classifier is trained and evaluated on a combination of community-recorded open-source data and long-duration soundscape recordings from Singapore. It achieves a mean F0.5 score of 0.701 across 315 classes from 110 bird species on a hold-out test set, with an average of 11 labeled training samples per class. It outperforms the state-of-the-art BirdNET classifier on a test set of 103 bird species despite significantly fewer labeled training samples. The detector is further tested on 144 microphone-hours of continuous soundscape data. The rich soundscape in Singapore makes suppression of false positives a challenge on raw, continuous data streams. Nevertheless, we demonstrate that achieving high precision in such environments with minimal labeled training data is possible.

Paper Structure

This paper contains 17 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Left: first deployment at the SBG (site #1) from July 4, 2020 until September 20, 2020. The approximate locations of the recording units are marked as yellow circles, with the corresponding external microphones of each unit marked as orange circles. The 6 microphones surround a lake and cover an area of roughly 50 m $\times$ 50 m. Right: a section of the elevated boardwalk used for the second deployment at the SBG (site #2) from September 20, 2020 until February 1, 2021. The same 6 microphones from site #1 were deployed along the circular boardwalk in a similar constellation, but without direct line of sight between units due to the dense vegetation. Photos from Google Maps.
  • Figure 2: An example illustrating extraction of TFRs from audio recordings. The left panel shows the spectrogram from a 5 second audio clip, with a few detected sounds enclosed by white rectangles. The right panels show the respective extracted TFRs. This example is non-exhaustive, i.e., not all detections in the audio clip are shown here.
  • Figure 3: Architecture of the convolutional auto-encoder. The network enables a $64 \times$ data compression by learning a latent representation of the TFRs which retains most of the information.
  • Figure 4: Examples of calls from (a): Crimson Sunbird (Aethopyga siparaja, 558466), (b): Common Hill Myna (Gracula religiosa, 179652), (c): Olive-winged Bulbul (Pycnonotus plumosus, 562623) and (d): Lineated Barbet (Psilopogon lineatus, 1145226). The top row shows the TFRs after extraction from raw audio recordings, and the bottom row shows the compressed TFRs after passing through the auto-encoder. The high similarity between each pair shows that the compressed latent representation is capable of retaining most of the information in the TFRs. To limit sounds from different sources from merging, TFRs do not capture entire calls/songs if the pause between subsequent vocalizations are too long. The Olive-winged Bulbul in column c is one such example, where only a part of a longer call sequence is captured.
  • Figure 5: Architecture of the contrastive learning neural network. The $\tanh(\cdot)$ activation function in the last dense layer ensures all entries in the 1024-dimensional embedding space representation are positive, and the final normalization layer ensures that they are scaled such that the embedding space can be thought of as the surface of a hypersphere of unit radius. The similarity between embedding space representations can then be measured in terms of the dot product of the corresponding vectors. The distance between embedding space representations can be measured as the angle between vectors, or equivalently the distance on the surface of the hypersphere.
  • ...and 2 more figures