Towards Leveraging Sequential Structure in Animal Vocalizations
Eklavya Sarkar, Mathew Magimai. -Doss
TL;DR
This work addresses the challenge of preserving temporal structure in animal vocalizations by converting HuBERT-derived embeddings into discrete per-frame tokens via vector quantization (VQ) and gumbel-softmax VQ (GVQ). The authors evaluate whether token sequences retain informative sequential information using Levenshtein distance-based analyses and k-NN classification across four bioacoustic datasets, comparing against a pooling-based linear baseline. Findings show that VQ tokens can reflect call-type and caller information to some extent, with stronger discrimination observed for call-types, while GVQ exhibits instability (notably codebook collapse) in several datasets; overall, a single codebook falls short of the linear baseline, signaling the need for more sophisticated sequence modeling and multi-codebook approaches. The work highlights the potential and limitations of discrete audio tokens for bioacoustic sequence analysis and outlines concrete avenues for improving token-based representations and downstream tasks in animal communication research.
Abstract
Animal vocalizations contain sequential structures that carry important communicative information, yet most computational bioacoustics studies average the extracted frame-level features across the temporal axis, discarding the order of the sub-units within a vocalization. This paper investigates whether discrete acoustic token sequences, derived through vector quantization and gumbel-softmax vector quantization of extracted self-supervised speech model representations can effectively capture and leverage temporal information. To that end, pairwise distance analysis of token sequences generated from HuBERT embeddings shows that they can discriminate call-types and callers across four bioacoustics datasets. Sequence classification experiments using $k$-Nearest Neighbour with Levenshtein distance show that the vector-quantized token sequences yield reasonable call-type and caller classification performances, and hold promise as alternative feature representations towards leveraging sequential information in animal vocalizations.
