Table of Contents
Fetching ...

Feature Representations for Automatic Meerkat Vocalization Classification

Imen Ben Mahmoud, Eklavya Sarkar, Marta Manser, Mathew Magimai. -Doss

TL;DR

This work tackles automatic meerkat vocalization classification by comparing multiple feature representations, including knowledge-based hand-crafted features (Catch22, COMPARE, eGeMAPS) and neural representations from self-supervised models (WavLM, wav2vec2, HuBERT) as well as a CNN-crafted end-to-end approach. The authors evaluate these representations on two real-world datasets (Set A and Set B) using a 5-fold cross-validation SVM framework and Unweighted Average Recall (UAR) as the metric. Results show that CNN-crafted features yield the best performance, while hand-crafted features like eGeMAPS and COMPARE remain competitive; lower-layer SSL embeddings also outperform higher-layer ones, indicating effective transfer of human-speech pretraining to meerkat calls. The findings demonstrate that diverse feature representations—spanning traditional signal processing, SSL embeddings, and task-specific CNN features—can effectively support automatic meerkat call classification and encourage further interpretability of the acoustic cues involved.

Abstract

Understanding evolution of vocal communication in social animals is an important research problem. In that context, beyond humans, there is an interest in analyzing vocalizations of other social animals such as, meerkats, marmosets, apes. While existing approaches address vocalizations of certain species, a reliable method tailored for meerkat calls is lacking. To that extent, this paper investigates feature representations for automatic meerkat vocalization analysis. Both traditional signal processing-based representations and data-driven representations facilitated by advances in deep learning are explored. Call type classification studies conducted on two data sets reveal that feature extraction methods developed for human speech processing can be effectively employed for automatic meerkat call analysis.

Feature Representations for Automatic Meerkat Vocalization Classification

TL;DR

This work tackles automatic meerkat vocalization classification by comparing multiple feature representations, including knowledge-based hand-crafted features (Catch22, COMPARE, eGeMAPS) and neural representations from self-supervised models (WavLM, wav2vec2, HuBERT) as well as a CNN-crafted end-to-end approach. The authors evaluate these representations on two real-world datasets (Set A and Set B) using a 5-fold cross-validation SVM framework and Unweighted Average Recall (UAR) as the metric. Results show that CNN-crafted features yield the best performance, while hand-crafted features like eGeMAPS and COMPARE remain competitive; lower-layer SSL embeddings also outperform higher-layer ones, indicating effective transfer of human-speech pretraining to meerkat calls. The findings demonstrate that diverse feature representations—spanning traditional signal processing, SSL embeddings, and task-specific CNN features—can effectively support automatic meerkat call classification and encourage further interpretability of the acoustic cues involved.

Abstract

Understanding evolution of vocal communication in social animals is an important research problem. In that context, beyond humans, there is an interest in analyzing vocalizations of other social animals such as, meerkats, marmosets, apes. While existing approaches address vocalizations of certain species, a reliable method tailored for meerkat calls is lacking. To that extent, this paper investigates feature representations for automatic meerkat vocalization analysis. Both traditional signal processing-based representations and data-driven representations facilitated by advances in deep learning are explored. Call type classification studies conducted on two data sets reveal that feature extraction methods developed for human speech processing can be effectively employed for automatic meerkat call analysis.
Paper Structure (10 sections, 4 figures, 6 tables)

This paper contains 10 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Diagram of the workflow of the study. N denotes number of frames.
  • Figure 2: Confusion matrices for SVM classifier using, from left to right, WavLM, CNN-crafted, COMPARE and Catch22 embeddings on the test set of Set A.
  • Figure 3: Confusion matrices for SVM classifier using, from left to right, WavLM, CNN-crafted, COMPARE and Catch22 embeddings on the test set of Set B.
  • Figure 4: Cumulative frequency responses of first layer filters of CNN