Table of Contents
Fetching ...

Towards Dog Bark Decoding: Leveraging Human Speech Processing for Automated Bark Classification

Artem Abzaliev, Humberto Pérez Espinosa, Rada Mihalcea

TL;DR

This work investigates whether self-supervised representations learned from human speech can be leveraged to analyze dog vocalizations. By fine-tuning Wav2Vec2 on a dog-dedicated dataset and comparing against a model trained from scratch, the authors show that embedding-based features significantly outperform baselines across dog recognition, breed identification, gender, and grounding tasks, with additional gains when using a Librispeech-pretrained model. The results demonstrate promising cross-domain transfer from human speech processing to animal communication, highlighting the potential of NLP-inspired representations to unlock semantic information in animal vocalizations. The study also provides a publicly available dataset and baselines to spur further research in animal communication and cross-domain representation learning.

Abstract

Similar to humans, animals make extensive use of verbal and non-verbal forms of communication, including a large range of audio signals. In this paper, we address dog vocalizations and explore the use of self-supervised speech representation models pre-trained on human speech to address dog bark classification tasks that find parallels in human-centered tasks in speech recognition. We specifically address four tasks: dog recognition, breed identification, gender classification, and context grounding. We show that using speech embedding representations significantly improves over simpler classification baselines. Further, we also find that models pre-trained on large human speech acoustics can provide additional performance boosts on several tasks.

Towards Dog Bark Decoding: Leveraging Human Speech Processing for Automated Bark Classification

TL;DR

This work investigates whether self-supervised representations learned from human speech can be leveraged to analyze dog vocalizations. By fine-tuning Wav2Vec2 on a dog-dedicated dataset and comparing against a model trained from scratch, the authors show that embedding-based features significantly outperform baselines across dog recognition, breed identification, gender, and grounding tasks, with additional gains when using a Librispeech-pretrained model. The results demonstrate promising cross-domain transfer from human speech processing to animal communication, highlighting the potential of NLP-inspired representations to unlock semantic information in animal vocalizations. The study also provides a publicly available dataset and baselines to spur further research in animal communication and cross-domain representation learning.

Abstract

Similar to humans, animals make extensive use of verbal and non-verbal forms of communication, including a large range of audio signals. In this paper, we address dog vocalizations and explore the use of self-supervised speech representation models pre-trained on human speech to address dog bark classification tasks that find parallels in human-centered tasks in speech recognition. We specifically address four tasks: dog recognition, breed identification, gender classification, and context grounding. We show that using speech embedding representations significantly improves over simpler classification baselines. Further, we also find that models pre-trained on large human speech acoustics can provide additional performance boosts on several tasks.
Paper Structure (16 sections, 1 figure, 5 tables)

This paper contains 16 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Data collection for the stimulus "playing with toy"; the owner stimulates the dog using the toys with which the dog normally plays.