Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing
Eklavya Sarkar, Mathew Magimai. -Doss
TL;DR
This work evaluates whether SSL models pre-trained on animal vocalizations outperform speech-pretrained SSLs for bioacoustic tasks and whether ASR fine-tuning on speech data adds benefits. Using three diverse datasets, multiple SSL architectures, and a layer-wise analysis with a simple MLP classifier, the study contrasts pre-training domains and fine-tuning effects, reporting results via UAR. The findings indicate that speech-pretrained models are largely robust for bioacoustics, with only marginal gains from bioacoustic pre-training in some cases, and that ASR fine-tuning does not consistently improve performance. Overall, the paper underscores the transferability of speech SSL representations to bioacoustics and questions the necessity of extensive fine-tuning for these tasks, while pointing to attention-based avenues for future work.
Abstract
Self-supervised learning (SSL) foundation models have emerged as powerful, domain-agnostic, general-purpose feature extractors applicable to a wide range of tasks. Such models pre-trained on human speech have demonstrated high transferability for bioacoustic processing. This paper investigates (i) whether SSL models pre-trained directly on animal vocalizations offer a significant advantage over those pre-trained on speech, and (ii) whether fine-tuning speech-pretrained models on automatic speech recognition (ASR) tasks can enhance bioacoustic classification. We conduct a comparative analysis using three diverse bioacoustic datasets and two different bioacoustic tasks. Results indicate that pre-training on bioacoustic data provides only marginal improvements over speech-pretrained models, with comparable performance in most scenarios. Fine-tuning on ASR tasks yields mixed outcomes, suggesting that the general-purpose representations learned during SSL pre-training are already well-suited for bioacoustic tasks. These findings highlight the robustness of speech-pretrained SSL models for bioacoustics and imply that extensive fine-tuning may not be necessary for optimal performance.
