Table of Contents
Fetching ...

Exploring bat song syllable representations in self-supervised audio encoders

Marianne de Heer Kloots, Mirjam Knörnschild

TL;DR

The paper investigates whether self-supervised audio encoders trained on non-bat sounds can produce discriminative representations of bat song syllables. By processing Saccopteryx bilineata territorial songs and evaluating embeddings from AVES, HuBERT, and Wav2Vec2 variants, the study quantifies syllable separability via LDA projections and silhouette distances. The main finding is that models pre-trained on human speech yield the most distinct syllable-type subspaces, with animal-vocalization and music-trained models following, suggesting cross-species transfer learning can be effective for bat bioacoustics. The results highlight the influence of pretraining domain and model architecture on cross-species decoding and point to future work involving fine-tuning and interpretability to better identify informative features.

Abstract

How well can deep learning models trained on human-generated sounds distinguish between another species' vocalization types? We analyze the encoding of bat song syllables in several self-supervised audio encoders, and find that models pre-trained on human speech generate the most distinctive representations of different syllable types. These findings form first steps towards the application of cross-species transfer learning in bat bioacoustics, as well as an improved understanding of out-of-distribution signal processing in audio encoder models.

Exploring bat song syllable representations in self-supervised audio encoders

TL;DR

The paper investigates whether self-supervised audio encoders trained on non-bat sounds can produce discriminative representations of bat song syllables. By processing Saccopteryx bilineata territorial songs and evaluating embeddings from AVES, HuBERT, and Wav2Vec2 variants, the study quantifies syllable separability via LDA projections and silhouette distances. The main finding is that models pre-trained on human speech yield the most distinct syllable-type subspaces, with animal-vocalization and music-trained models following, suggesting cross-species transfer learning can be effective for bat bioacoustics. The results highlight the influence of pretraining domain and model architecture on cross-species decoding and point to future work involving fine-tuning and interpretability to better identify informative features.

Abstract

How well can deep learning models trained on human-generated sounds distinguish between another species' vocalization types? We analyze the encoding of bat song syllables in several self-supervised audio encoders, and find that models pre-trained on human speech generate the most distinctive representations of different syllable types. These findings form first steps towards the application of cross-species transfer learning in bat bioacoustics, as well as an improved understanding of out-of-distribution signal processing in audio encoder models.
Paper Structure (8 sections, 2 figures, 1 table)

This paper contains 8 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Syllable projections along the two most discriminative directions in each LDA-transformed feature space.
  • Figure 2: Separability between syllable type clusters is highest in the self-supervised models trained on human speech (error bars show 95% confidence intervals).