Table of Contents
Fetching ...

Generalization in birdsong classification: impact of transfer learning methods and dataset characteristics

Burooj Ghani, Vincent J. Kalkman, Bob Planqué, Willem-Pier Vellinga, Lisa Gill, Dan Stowell

TL;DR

This study evaluates transfer learning strategies for birdsong classification, comparing deep fine-tuning, shallow finetuning, and knowledge distillation (including cross-model distillation) across CNN and Transformer architectures on European bird data. Using Xeno-canto and Dawn Chorus datasets, it shows that cross-model distillation yields strong in-domain performance, while shallow finetuning often generalizes best to diverse soundscapes. Labeling practices, particularly incorporating background labels and temporal details, significantly influence performance, with secondary labels boosting recall and AUROC but sometimes reducing precision. The results guide practical reuse of pretrained models for automatic bioacoustic recognition and motivate improvements in data labeling to enhance robustness in biodiversity monitoring.

Abstract

Animal sounds can be recognised automatically by machine learning, and this has an important role to play in biodiversity monitoring. Yet despite increasingly impressive capabilities, bioacoustic species classifiers still exhibit imbalanced performance across species and habitats, especially in complex soundscapes. In this study, we explore the effectiveness of transfer learning in large-scale bird sound classification across various conditions, including single- and multi-label scenarios, and across different model architectures such as CNNs and Transformers. Our experiments demonstrate that both fine-tuning and knowledge distillation yield strong performance, with cross-distillation proving particularly effective in improving in-domain performance on Xeno-canto data. However, when generalizing to soundscapes, shallow fine-tuning exhibits superior performance compared to knowledge distillation, highlighting its robustness and constrained nature. Our study further investigates how to use multi-species labels, in cases where these are present but incomplete. We advocate for more comprehensive labeling practices within the animal sound community, including annotating background species and providing temporal details, to enhance the training of robust bird sound classifiers. These findings provide insights into the optimal reuse of pretrained models for advancing automatic bioacoustic recognition.

Generalization in birdsong classification: impact of transfer learning methods and dataset characteristics

TL;DR

This study evaluates transfer learning strategies for birdsong classification, comparing deep fine-tuning, shallow finetuning, and knowledge distillation (including cross-model distillation) across CNN and Transformer architectures on European bird data. Using Xeno-canto and Dawn Chorus datasets, it shows that cross-model distillation yields strong in-domain performance, while shallow finetuning often generalizes best to diverse soundscapes. Labeling practices, particularly incorporating background labels and temporal details, significantly influence performance, with secondary labels boosting recall and AUROC but sometimes reducing precision. The results guide practical reuse of pretrained models for automatic bioacoustic recognition and motivate improvements in data labeling to enhance robustness in biodiversity monitoring.

Abstract

Animal sounds can be recognised automatically by machine learning, and this has an important role to play in biodiversity monitoring. Yet despite increasingly impressive capabilities, bioacoustic species classifiers still exhibit imbalanced performance across species and habitats, especially in complex soundscapes. In this study, we explore the effectiveness of transfer learning in large-scale bird sound classification across various conditions, including single- and multi-label scenarios, and across different model architectures such as CNNs and Transformers. Our experiments demonstrate that both fine-tuning and knowledge distillation yield strong performance, with cross-distillation proving particularly effective in improving in-domain performance on Xeno-canto data. However, when generalizing to soundscapes, shallow fine-tuning exhibits superior performance compared to knowledge distillation, highlighting its robustness and constrained nature. Our study further investigates how to use multi-species labels, in cases where these are present but incomplete. We advocate for more comprehensive labeling practices within the animal sound community, including annotating background species and providing temporal details, to enhance the training of robust bird sound classifiers. These findings provide insights into the optimal reuse of pretrained models for advancing automatic bioacoustic recognition.
Paper Structure (23 sections, 3 equations, 7 figures, 5 tables)

This paper contains 23 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Transfer learning strategies. Light-coloured blocks are neural networks being trained; dark-coloured blocks are 'frozen' and unchanging during transfer learning. Shallow fine-tuning (top) uses most of the pretrained model as a fixed feature extractor, retraining the final layer(s) on the new dataset. Deep fine-tuning (middle) retrains all layers. Knowledge distillation (lower) differs from both of these, training a new model to produce the same outputs as the teacher model, and also to match ground-truth labels when they are available.
  • Figure 2: Geographic distribution of our data sourced from Xeno-canto.
  • Figure 3: Histogram depicting the number of labeled species within 3-second chunks from the Dawn Chorus dataset, illustrating the degree of polyphony present.
  • Figure 4: (Left) Distribution of the number of recordings per species in the Xeno-Canto dataset for multi-label case. The y-axis represents the count of recordings per species. The mark within the box indicates the median value, and the box itself represents the interquartile range (middle 50% of the species). (Right) Distribution of the total number of species annotated per file in our Xeno-canto data. The large peak at 1 indicates that many sound files are annotated as having no other species audible in the background.
  • Figure 5: Performance of different models evaluated on the held-out test dataset from Xeno-canto. (a) Table showing various metrics including mAP, AUROC, epochs, time per epoch, and total training time. (b) Species-wise distribution of scores using the approaches listed in (a). Entries are bold-faced if the model scored the highest.
  • ...and 2 more figures