Comparison of self-supervised in-domain and supervised out-domain transfer learning for bird species recognition
Houtan Ghaffari, Paul Devos
TL;DR
The paper investigates whether in-domain self-supervised pretraining offers advantages over ImageNet-supervised pretraining for bird species recognition when labeled data are limited. It uses VICReg-based SSL trained on BirdCLEF2021 to pretrain a ResNeXt-50 backbone and evaluates downstream performance on the WMWB bird dataset, comparing against ImageNet-pretrained and random-initialized baselines under 1% and 10% labeled data. The results show that in-domain SSL pretraining significantly outperforms ImageNet baselines, with the VICReg-full finetuned model achieving the best scores; the gains persist even with very small label sets. The work highlights the practicality of domain-specific SSL in bioacoustics and suggests that larger, domain-focused SSL models could broadly improve data-efficient audio recognition tasks.
Abstract
Transferring the weights of a pre-trained model to assist another task has become a crucial part of modern deep learning, particularly in data-scarce scenarios. Pre-training refers to the initial step of training models outside the current task of interest, typically on another dataset. It can be done via supervised models using human-annotated datasets or self-supervised models trained on unlabeled datasets. In both cases, many pre-trained models are available to fine-tune for the task of interest. Interestingly, research has shown that pre-trained models from ImageNet can be helpful for audio tasks despite being trained on image datasets. Hence, it's unclear whether in-domain models would be advantageous compared to competent out-domain models, such as convolutional neural networks from ImageNet. Our experiments will demonstrate the usefulness of in-domain models and datasets for bird species recognition by leveraging VICReg, a recent and powerful self-supervised method.
