Table of Contents
Fetching ...

Crossing the Species Divide: Transfer Learning from Speech to Animal Sounds

Jules Cauzinille, Marius Miron, Olivier Pietquin, Masato Hagiwara, Ricard Marxer, Arnaud Rey, Benoit Favre

TL;DR

This work investigates whether self-supervised speech models can transfer to bioacoustic tasks across diverse species. It evaluates HuBERT, WavLM, and XEUS using linear probing and time-aware downstream setups on the BEANS benchmark, examining effects of noise, time information, and frequency range. The findings show that speech-based representations can achieve competitive bioacoustic performance, with noise-robust pretraining and temporal attention aiding transfer, while simple linear probes often outperform more complex recurrent models. The study underscores the potential of speech-founded foundation models for data-limited bioacoustic research and outlines directions for future domain-tailored pretraining and evaluation.

Abstract

Self-supervised speech models have demonstrated impressive performance in speech processing, but their effectiveness on non-speech data remains underexplored. We study the transfer learning capabilities of such models on bioacoustic detection and classification tasks. We show that models such as HuBERT, WavLM, and XEUS can generate rich latent representations of animal sounds across taxa. We analyze the models properties with linear probing on time-averaged representations. We then extend the approach to account for the effect of time-wise information with other downstream architectures. Finally, we study the implication of frequency range and noise on performance. Notably, our results are competitive with fine-tuned bioacoustic pre-trained models and show the impact of noise-robust pre-training setups. These findings highlight the potential of speech-based self-supervised learning as an efficient framework for advancing bioacoustic research.

Crossing the Species Divide: Transfer Learning from Speech to Animal Sounds

TL;DR

This work investigates whether self-supervised speech models can transfer to bioacoustic tasks across diverse species. It evaluates HuBERT, WavLM, and XEUS using linear probing and time-aware downstream setups on the BEANS benchmark, examining effects of noise, time information, and frequency range. The findings show that speech-based representations can achieve competitive bioacoustic performance, with noise-robust pretraining and temporal attention aiding transfer, while simple linear probes often outperform more complex recurrent models. The study underscores the potential of speech-founded foundation models for data-limited bioacoustic research and outlines directions for future domain-tailored pretraining and evaluation.

Abstract

Self-supervised speech models have demonstrated impressive performance in speech processing, but their effectiveness on non-speech data remains underexplored. We study the transfer learning capabilities of such models on bioacoustic detection and classification tasks. We show that models such as HuBERT, WavLM, and XEUS can generate rich latent representations of animal sounds across taxa. We analyze the models properties with linear probing on time-averaged representations. We then extend the approach to account for the effect of time-wise information with other downstream architectures. Finally, we study the implication of frequency range and noise on performance. Notably, our results are competitive with fine-tuned bioacoustic pre-trained models and show the impact of noise-robust pre-training setups. These findings highlight the potential of speech-based self-supervised learning as an efficient framework for advancing bioacoustic research.

Paper Structure

This paper contains 14 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Workflow of the transfer learning method
  • Figure 2: Performance for the Egyptian fruit bats dataset on the 10th layer with pitch shifting (T-A)
  • Figure 3: Performance for the Egyptian fruit bats dataset on the 10th layer with noise addition (T-A)