Table of Contents
Fetching ...

Towards Leveraging Sequential Structure in Animal Vocalizations

Eklavya Sarkar, Mathew Magimai. -Doss

TL;DR

This work addresses the challenge of preserving temporal structure in animal vocalizations by converting HuBERT-derived embeddings into discrete per-frame tokens via vector quantization (VQ) and gumbel-softmax VQ (GVQ). The authors evaluate whether token sequences retain informative sequential information using Levenshtein distance-based analyses and k-NN classification across four bioacoustic datasets, comparing against a pooling-based linear baseline. Findings show that VQ tokens can reflect call-type and caller information to some extent, with stronger discrimination observed for call-types, while GVQ exhibits instability (notably codebook collapse) in several datasets; overall, a single codebook falls short of the linear baseline, signaling the need for more sophisticated sequence modeling and multi-codebook approaches. The work highlights the potential and limitations of discrete audio tokens for bioacoustic sequence analysis and outlines concrete avenues for improving token-based representations and downstream tasks in animal communication research.

Abstract

Animal vocalizations contain sequential structures that carry important communicative information, yet most computational bioacoustics studies average the extracted frame-level features across the temporal axis, discarding the order of the sub-units within a vocalization. This paper investigates whether discrete acoustic token sequences, derived through vector quantization and gumbel-softmax vector quantization of extracted self-supervised speech model representations can effectively capture and leverage temporal information. To that end, pairwise distance analysis of token sequences generated from HuBERT embeddings shows that they can discriminate call-types and callers across four bioacoustics datasets. Sequence classification experiments using $k$-Nearest Neighbour with Levenshtein distance show that the vector-quantized token sequences yield reasonable call-type and caller classification performances, and hold promise as alternative feature representations towards leveraging sequential information in animal vocalizations.

Towards Leveraging Sequential Structure in Animal Vocalizations

TL;DR

This work addresses the challenge of preserving temporal structure in animal vocalizations by converting HuBERT-derived embeddings into discrete per-frame tokens via vector quantization (VQ) and gumbel-softmax VQ (GVQ). The authors evaluate whether token sequences retain informative sequential information using Levenshtein distance-based analyses and k-NN classification across four bioacoustic datasets, comparing against a pooling-based linear baseline. Findings show that VQ tokens can reflect call-type and caller information to some extent, with stronger discrimination observed for call-types, while GVQ exhibits instability (notably codebook collapse) in several datasets; overall, a single codebook falls short of the linear baseline, signaling the need for more sophisticated sequence modeling and multi-codebook approaches. The work highlights the potential and limitations of discrete audio tokens for bioacoustic sequence analysis and outlines concrete avenues for improving token-based representations and downstream tasks in animal communication research.

Abstract

Animal vocalizations contain sequential structures that carry important communicative information, yet most computational bioacoustics studies average the extracted frame-level features across the temporal axis, discarding the order of the sub-units within a vocalization. This paper investigates whether discrete acoustic token sequences, derived through vector quantization and gumbel-softmax vector quantization of extracted self-supervised speech model representations can effectively capture and leverage temporal information. To that end, pairwise distance analysis of token sequences generated from HuBERT embeddings shows that they can discriminate call-types and callers across four bioacoustics datasets. Sequence classification experiments using -Nearest Neighbour with Levenshtein distance show that the vector-quantized token sequences yield reasonable call-type and caller classification performances, and hold promise as alternative feature representations towards leveraging sequential information in animal vocalizations.

Paper Structure

This paper contains 15 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Discrete call tokenization pipeline using vector quantization.
  • Figure 2: Layer-wise mean Levenshtein distance between all pairs of VQ and GVQ token sequences.
  • Figure 3: Layer-wise UAR [%] for CTID (top) and CLID (bottom) using $k$-NN on token sequences.
  • Figure 4: Best UAR results across layers for CTID and CLID.