Perch 2.0: The Bittern Lesson for Bioacoustics
Bart van Merriënboer, Vincent Dumoulin, Jenny Hamer, Lauren Harrell, Andrea Burns, Tom Denton
TL;DR
Perch 2.0 tackles cross-taxa bioacoustic transfer by expanding supervised pretraining to a large multi-taxa labeled corpus and introducing self-distillation with a prototype-learning head and a source-prediction objective. The approach employs a compact EfficientNet-B3 backbone, a three-head output design, and a two-phase training regime with multi-component mixup to produce highly transferable embeddings suitable for linear probing and retrieval. It achieves state-of-the-art results on BirdSet and BEANS without embedding fine-tuning and demonstrates strong marine transfer, highlighting the practical impact for conservation and biodiversity monitoring. The work argues that fine-grained, supervised labels, coupled with domain-aware augmentations and auxiliary tasks, yield robust, scalable representations with real-world applicability across diverse bioacoustic domains.
Abstract
Perch is a performant pre-trained model for bioacoustics. It was trained in supervised fashion, providing both off-the-shelf classification scores for thousands of vocalizing species as well as strong embeddings for transfer learning. In this new release, Perch 2.0, we expand from training exclusively on avian species to a large multi-taxa dataset. The model is trained with self-distillation using a prototype-learning classifier as well as a new source-prediction training criterion. Perch 2.0 obtains state-of-the-art performance on the BirdSet and BEANS benchmarks. It also outperforms specialized marine models on marine transfer learning tasks, despite having almost no marine training data. We present hypotheses as to why fine-grained species classification is a particularly robust pre-training task for bioacoustics.
