animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

Julian C. Schäfer-Zimmermann; Vlad Demartsev; Baptiste Averly; Kiran Dhanjal-Adams; Mathieu Duteil; Gabriella Gall; Marius Faiß; Lily Johnson-Ulrich; Dan Stowell; Marta B. Manser; Marie A. Roch; Ariana Strandburg-Peshkin

animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

Julian C. Schäfer-Zimmermann, Vlad Demartsev, Baptiste Averly, Kiran Dhanjal-Adams, Mathieu Duteil, Gabriella Gall, Marius Faiß, Lily Johnson-Ulrich, Dan Stowell, Marta B. Manser, Marie A. Roch, Ariana Strandburg-Peshkin

TL;DR

This work introduces animal2vec, a self-supervised transformer tailored for sparse bioacoustic data, and MeerKAT, the largest public dataset of non-human terrestrial mammal vocalizations with millisecond-level annotations. By employing mean-teacher distillation and a domain-specific SincNet frontend, the approach pretrains on unlabeled audio and finetunes with limited labels, achieving state-of-the-art performance on MeerKAT and competitive results on NIPS4Bplus with strong few-shot capabilities. The combination of robust regularization, event-focused evaluation, and interpretability via attention and spectral analyses provides a practical, scalable foundation for bioacoustic analysis and sets up MeerKAT as a valuable reference benchmark for future pretraining/finetuning efforts. The work also outlines a vision for a foundational bioacoustic model that can be adapted across species and data modalities with modest labeling requirements.

Abstract

Bioacoustic research, vital for understanding animal behavior, conservation, and ecology, faces a monumental challenge: analyzing vast datasets where animal vocalizations are rare. While deep learning techniques are becoming standard, adapting them to bioacoustics remains difficult. We address this with animal2vec, an interpretable large transformer model, and a self-supervised training scheme tailored for sparse and unbalanced bioacoustic data. It learns from unlabeled audio and then refines its understanding with labeled data. Furthermore, we introduce and publicly release MeerKAT: Meerkat Kalahari Audio Transcripts, a dataset of meerkat (Suricata suricatta) vocalizations with millisecond-resolution annotations, the largest labeled dataset on non-human terrestrial mammals currently available. Our model outperforms existing methods on MeerKAT and the publicly available NIPS4Bplus birdsong dataset. Moreover, animal2vec performs well even with limited labeled data (few-shot learning). animal2vec and MeerKAT provide a new reference point for bioacoustic research, enabling scientists to analyze large amounts of data even with scarce ground truth information.

animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

TL;DR

Abstract

Paper Structure (35 sections, 1 equation, 20 figures, 6 tables)

This paper contains 35 sections, 1 equation, 20 figures, 6 tables.

Introduction
Results
I. The MeerKAT bioacoustic dataset
II. The animal2vec framework
Design of animal2vec
Finetuning animal2vec
Evaluating animal2vec
III. Performance of animal2vec on the MeerKAT dataset
IV. Interpreting animal2vec trained models
V. Performance of animal2vec on the NIPS4Bplus benchmark dataset
Discussion
Methods
1: Mean-teacher distillation
2: MeerKAT audio and labels
3: Experimental design for MeerKAT
...and 20 more sections

Figures (20)

Figure 1: The statistics of the MeerKAT dataset and precision-recall curves of the presented classifier. a) shows the temporal distributions of all MeerKAT classes in 12.0 violin plots. Each category shows kernel density estimates of duration for the class (colored splits on the right). The global distribution across all categories is shown in gray on the left of each plot to make clear how the label durations of each category relate to the dataset overall. All splits are scaled to full width, where the scaling multiplier is shown at the top of each split, as the number of examples for each category varies considerably. In each split, dashed lines show the 25.0th, 50.0th, and 75.0th percentile, where the 50.0th percentile (median) value is written next to its dashed line. In addition, the event-count, the total duration in minutes, and the percentage with respect to all counts/total duration are displayed at the top of each plot. b) shows four precision-recall curves for (i) the global micro average, and the (ii) close call, (iii) short-note call, and (iv) alarm call class. Results of animal2vec using 1%, 25%, and 100% of the training data are in red, yellow, and teal, respectively, and the baseline results are in gray. Overlays within each subplot show statistics about the occurrence-wise percentage share and the median duration of all events in this class.
Figure 2: Example Mel spectrograms for a representative audio snippet and for each class in dBr scale. (a) a representative stream of audio and (b) the individual classes in MeerKAT. a) shows four alarm call events covered by a varying amount of spectrally broad, ultra-short, and non-stationary noise patterns originating from the MeerKATs foraging for food by digging in the ground or bumping their collars into obstacles. Noise patterns such as these permeate the majority of MeerKAT. b) shows the spectral variability between classes, where the examples shown do not represent the overall data quality but reflect clean candidates.
Figure 3: animal2vec pretraining schematic.
Figure 4: Globally averaged attention map of a four-second segment showing four move calls. animal2vec operates on pressure waves, but spectrograms are shown here for visualization. Each row shows the importance of the surrounding context for predicting the class associated with an audio frame where dashed lines show the onset/offset of each animal2vec call prediction, which are additionally shown using a blue colormap. An attention map shows the importance of every input frame with respect to every other frame. For predicting, animal2vec attends to the immediate past and future of an event, as well as to a previous instance in the case of the noisy move vocalization.
Figure 5: Mask length distributions of the baseline (solid red line) and our animal2vec model (solid yellow line) during pretraining. For comparison we also show the distribution of the sample lengths (dashed teal line). Modeled after figure 2 in the appendix in Baevski2020-yy.
...and 15 more figures

animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

TL;DR

Abstract

animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

Authors

TL;DR

Abstract

Table of Contents

Figures (20)