Table of Contents
Fetching ...

On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis

Eklavya Sarkar, Mathew Magimai. -Doss

TL;DR

This study examines whether speech- and general-audio foundation models pre-trained on human data can transfer to cross-domain marmoset call analysis, addressing call-type (CTID) and caller identity (CLID) classification across bandwidths of 4, 8, and 16 kHz. It compares four feature families (hand-crafted Catch22 baseline, SSL models trained on speech, SSL models trained on general audio, and supervised general-audio models) using the InfantMarmosetsVox dataset, with both similarity and non-linear classification analyses. Results show bandwidth scaling yields monotonic performance gains, and pre-training on speech vs general audio yields comparable improvements over the spectral baseline, with BYOL-A and PANN achieving strong results depending on task. The findings highlight the potential of cross-domain foundation models for bioacoustic analysis when bandwidth aligns with the species’ vocal range, suggesting directions for interdisciplinary collaboration to interpret biological implications.

Abstract

Marmoset monkeys encode vital information in their calls and serve as a surrogate model for neuro-biologists to understand the evolutionary origins of human vocal communication. Traditionally analyzed with signal processing-based features, recent approaches have utilized self-supervised models pre-trained on human speech for feature extraction, capitalizing on their ability to learn a signal's intrinsic structure independently of its acoustic domain. However, the utility of such foundation models remains unclear for marmoset call analysis in terms of multi-class classification, bandwidth, and pre-training domain. This study assesses feature representations derived from speech and general audio domains, across pre-training bandwidths of 4, 8, and 16 kHz for marmoset call-type and caller classification tasks. Results show that models with higher bandwidth improve performance, and pre-training on speech or general audio yields comparable results, improving over a spectral baseline.

On the Utility of Speech and Audio Foundation Models for Marmoset Call Analysis

TL;DR

This study examines whether speech- and general-audio foundation models pre-trained on human data can transfer to cross-domain marmoset call analysis, addressing call-type (CTID) and caller identity (CLID) classification across bandwidths of 4, 8, and 16 kHz. It compares four feature families (hand-crafted Catch22 baseline, SSL models trained on speech, SSL models trained on general audio, and supervised general-audio models) using the InfantMarmosetsVox dataset, with both similarity and non-linear classification analyses. Results show bandwidth scaling yields monotonic performance gains, and pre-training on speech vs general audio yields comparable improvements over the spectral baseline, with BYOL-A and PANN achieving strong results depending on task. The findings highlight the potential of cross-domain foundation models for bioacoustic analysis when bandwidth aligns with the species’ vocal range, suggesting directions for interdisciplinary collaboration to interpret biological implications.

Abstract

Marmoset monkeys encode vital information in their calls and serve as a surrogate model for neuro-biologists to understand the evolutionary origins of human vocal communication. Traditionally analyzed with signal processing-based features, recent approaches have utilized self-supervised models pre-trained on human speech for feature extraction, capitalizing on their ability to learn a signal's intrinsic structure independently of its acoustic domain. However, the utility of such foundation models remains unclear for marmoset call analysis in terms of multi-class classification, bandwidth, and pre-training domain. This study assesses feature representations derived from speech and general audio domains, across pre-training bandwidths of 4, 8, and 16 kHz for marmoset call-type and caller classification tasks. Results show that models with higher bandwidth improve performance, and pre-training on speech or general audio yields comparable results, improving over a spectral baseline.
Paper Structure (8 sections, 5 figures, 3 tables)

This paper contains 8 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Marmoset vocalizations with a 16 kHz bandwidth. Top: Spectrograms of a single call-type vocalization. Bottom: The mean spectrum for all vocalizations per call-type across the dataset, normalized. Shaded areas indicate $\pm$ 1 std from the mean spectrum.
  • Figure 2: Pairwise mean cosine distances matrices for features $\mathcal{F}$ at different bandwidths for call-types (CTID) and callers (CLID). Diagonal entries represent intra-class distances, and off-diagonal the inter-class. Darker regions indicate higher similarity.
  • Figure 3: Distribution of pairwise cosine distances.
  • Figure 4: Normalized confusion matrices with row indices representing true class labels. Darker diagonals signify higher performance.
  • Figure 5: Layer-wise UAR scores of WavLM features, normalized per task. Darker regions indicate a higher performance.