Crossing the Species Divide: Transfer Learning from Speech to Animal Sounds
Jules Cauzinille, Marius Miron, Olivier Pietquin, Masato Hagiwara, Ricard Marxer, Arnaud Rey, Benoit Favre
TL;DR
This work investigates whether self-supervised speech models can transfer to bioacoustic tasks across diverse species. It evaluates HuBERT, WavLM, and XEUS using linear probing and time-aware downstream setups on the BEANS benchmark, examining effects of noise, time information, and frequency range. The findings show that speech-based representations can achieve competitive bioacoustic performance, with noise-robust pretraining and temporal attention aiding transfer, while simple linear probes often outperform more complex recurrent models. The study underscores the potential of speech-founded foundation models for data-limited bioacoustic research and outlines directions for future domain-tailored pretraining and evaluation.
Abstract
Self-supervised speech models have demonstrated impressive performance in speech processing, but their effectiveness on non-speech data remains underexplored. We study the transfer learning capabilities of such models on bioacoustic detection and classification tasks. We show that models such as HuBERT, WavLM, and XEUS can generate rich latent representations of animal sounds across taxa. We analyze the models properties with linear probing on time-averaged representations. We then extend the approach to account for the effect of time-wise information with other downstream architectures. Finally, we study the implication of frequency range and noise on performance. Notably, our results are competitive with fine-tuned bioacoustic pre-trained models and show the impact of noise-robust pre-training setups. These findings highlight the potential of speech-based self-supervised learning as an efficient framework for advancing bioacoustic research.
