Transfer Learning with Semi-Supervised Dataset Annotation for Birdcall Classification

Anthony Miyaguchi; Nathan Zhong; Murilo Gustineli; Chris Hayduk

Transfer Learning with Semi-Supervised Dataset Annotation for Birdcall Classification

Anthony Miyaguchi, Nathan Zhong, Murilo Gustineli, Chris Hayduk

TL;DR

The paper tackles the challenge of labeling long African soundscapes for numerous bird species by leveraging transfer learning through BirdNET-derived embeddings ($320$-dimensional) and semi-supervised dataset annotation via MixIT. It outlines a full pipeline, including source separation, embedding extraction, and annotated dataset construction, and evaluates a range of modeling approaches from logistic regression to ensemble methods, highlighting the baseline's competitive performance. The findings show that a simple embedding-space classifier can be remarkably effective, while more complex feature engineering yields mixed results, underscoring the importance of data quality and representation. The work demonstrates a scalable, semi-supervised framework for birdcall classification with potential applicability to other bioacoustic tasks and species-rich domains.

Abstract

We present working notes on transfer learning with semi-supervised dataset annotation for the BirdCLEF 2023 competition, focused on identifying African bird species in recorded soundscapes. Our approach utilizes existing off-the-shelf models, BirdNET and MixIT, to address representation and labeling challenges in the competition. We explore the embedding space learned by BirdNET and propose a process to derive an annotated dataset for supervised learning. Our experiments involve various models and feature engineering approaches to maximize performance on the competition leaderboard. The results demonstrate the effectiveness of our approach in classifying bird species and highlight the potential of transfer learning and semi-supervised dataset annotation in similar tasks.

Transfer Learning with Semi-Supervised Dataset Annotation for Birdcall Classification

TL;DR

The paper tackles the challenge of labeling long African soundscapes for numerous bird species by leveraging transfer learning through BirdNET-derived embeddings (

-dimensional) and semi-supervised dataset annotation via MixIT. It outlines a full pipeline, including source separation, embedding extraction, and annotated dataset construction, and evaluates a range of modeling approaches from logistic regression to ensemble methods, highlighting the baseline's competitive performance. The findings show that a simple embedding-space classifier can be remarkably effective, while more complex feature engineering yields mixed results, underscoring the importance of data quality and representation. The work demonstrates a scalable, semi-supervised framework for birdcall classification with potential applicability to other bioacoustic tasks and species-rich domains.

Abstract

Paper Structure (16 sections, 1 equation, 4 figures, 4 tables)

This paper contains 16 sections, 1 equation, 4 figures, 4 tables.

Introduction
Embedding Space and Transfer Learning
Semi-Supervised Dataset Annotation
Implementation and Workflow
Experiments
Baseline Model
Baseline Binary No-call Model
Interpolated Embedding Models
Concatenated Embedding Model
Ensemble Embedding Model
Probability Logit Model
Discussion
Semi-Supervised Annotation Quality
Audio Source Separation
Embedding Space and Transfer Learning
...and 1 more sections

Figures (4)

Figure 1: We demonstrate the clustering properties of the BirdNET embeddings by projecting them into $\mathcal{R}^{2}$ via UMAP mcinnes2020umap The projection preserves Euclidean distance in 2D space. We take the embedding token across each track with the most significant probability across the BirdNET prediction vector and assign it a positive label. The left plot shows clustering across the seven most common species in the training dataset. The right plot demonstrates a clear separation between the seven most common species.
Figure 2: Bird-MixIT has improved the precision of downstream classifiers in experimental settings. We demonstrate separation across entities in the mel-spectrogram of track XC207767 containing a Red-chested Cuckoo. We observe separating three sound signatures into sources 0, 1, and 3. Source 2 is an amalgamation of the other sound signatures, an artifact of the model separating into four channels. An automated process should choose source 3 containing the species of interest.
Figure 3: We use Luigi to coordinate a processing pipeline spanning days on an n2-standard-16 compute instance. We prevent processing skew across workers by recursively training audio. The audio is then source separated and embedded, resulting in a parquet file per audio chunk. We consolidate the parquet files into the final dataset.
Figure 4: We demonstrate the clustering properties of the binary no-call embeddings by projecting them into $\mathcal{R}^{2}$ via UMAP. The clustering compares the binary dataset with the overall species versus the top-3 species.

Transfer Learning with Semi-Supervised Dataset Annotation for Birdcall Classification

TL;DR

Abstract

Transfer Learning with Semi-Supervised Dataset Annotation for Birdcall Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (4)