Self-Supervised Learning for Few-Shot Bird Sound Classification
Ilyass Moummad, Romain Serizel, Nicolas Farrugia
TL;DR
This work demonstrates that self-supervised learning can extract meaningful, generalizable representations from unlabeled bird sounds, enabling effective few-shot classification of unseen species. By evaluating multiple SSL approaches (SimCLR, Barlow Twins, FroSSL) and a supervised baseline (SupCon) on the BirdCLEF-derived MetaAudio split, the authors show that SSL can outperform a CNN-based inference baseline and, in some cases, approach supervised performance. A key finding is that window-selection using a pretrained audio network (PANN/CNN14) to pick high bird-activation segments significantly boosts representation quality and downstream few-shot accuracy. The combination of simple domain-agnostic augmentations and targeted window selection yields robust bird-sound representations with practical implications for scalable bioacoustic monitoring and species discovery, while future work aims to quantify representation quality without validation-set performance.
Abstract
Self-supervised learning (SSL) in audio holds significant potential across various domains, particularly in situations where abundant, unlabeled data is readily available at no cost. This is pertinent in bioacoustics, where biologists routinely collect extensive sound datasets from the natural environment. In this study, we demonstrate that SSL is capable of acquiring meaningful representations of bird sounds from audio recordings without the need for annotations. Our experiments showcase that these learned representations exhibit the capacity to generalize to new bird species in few-shot learning (FSL) scenarios. Additionally, we show that selecting windows with high bird activation for self-supervised learning, using a pretrained audio neural network, significantly enhances the quality of the learned representations.
