Towards Dog Bark Decoding: Leveraging Human Speech Processing for Automated Bark Classification
Artem Abzaliev, Humberto Pérez Espinosa, Rada Mihalcea
TL;DR
This work investigates whether self-supervised representations learned from human speech can be leveraged to analyze dog vocalizations. By fine-tuning Wav2Vec2 on a dog-dedicated dataset and comparing against a model trained from scratch, the authors show that embedding-based features significantly outperform baselines across dog recognition, breed identification, gender, and grounding tasks, with additional gains when using a Librispeech-pretrained model. The results demonstrate promising cross-domain transfer from human speech processing to animal communication, highlighting the potential of NLP-inspired representations to unlock semantic information in animal vocalizations. The study also provides a publicly available dataset and baselines to spur further research in animal communication and cross-domain representation learning.
Abstract
Similar to humans, animals make extensive use of verbal and non-verbal forms of communication, including a large range of audio signals. In this paper, we address dog vocalizations and explore the use of self-supervised speech representation models pre-trained on human speech to address dog bark classification tasks that find parallels in human-centered tasks in speech recognition. We specifically address four tasks: dog recognition, breed identification, gender classification, and context grounding. We show that using speech embedding representations significantly improves over simpler classification baselines. Further, we also find that models pre-trained on large human speech acoustics can provide additional performance boosts on several tasks.
