Temporal Feature Learning in Weakly Labelled Bioacoustic Cetacean Datasets via a Variational Autoencoder and Temporal Convolutional Network: An Interdisciplinary Approach
Laia Garrobé Fonollosa, Douglas Gillespie, Lina Stankovic, Vladimir Stankovic, Luke Rendell
TL;DR
The paper tackles the challenge of weakly labelled passive acoustic monitoring (PAM) data for cetaceans by introducing a VAE–TCN framework that learns robust temporal features from diverse, multi-source datasets without hand-tuned thresholds. It standardises recordings into $4$-minute segments (and a $30$-second dataset for higher-resolution features), compares handcrafted features against variational autoencoder (VAE) embeddings, and uses Temporal Convolutional Networks (TCNs) to classify long sequences. The main findings show that 2D-VAE embeddings of spectrograms fed into a TCN yield high recall with low false-positive rates across deployments, achieving AUC above $0.9$, and that shorter $30$-second annotations can further improve performance for some feature types. This approach enables scalable analysis of large, weakly labelled PAM datasets and holds potential for transfer to other acoustic domains, alongside providing open data and code to promote reproducibility.
Abstract
Bioacoustics data from Passive acoustic monitoring (PAM) poses a unique set of challenges for classification, particularly the limited availability of complete and reliable labels in datasets due to annotation uncertainty, biological complexity due the heterogeneity in duration of cetacean vocalizations, and masking of target sounds due to environmental and anthropogenic noise. This means that data is often weakly labelled, with annotations indicating presence/absence of species over several minutes. In order to effectively capture the complex temporal patterns and key features of lengthy continuous audio segments, we propose an interdisciplinary framework comprising dataset standardisation, feature extraction via Variational Autoencoders (VAE) and classification via Temporal Convolutional Networks (TCN). This approach eliminates the necessity for manual threshold setting or time-consuming strong labelling. To demonstrate the effectiveness of our approach, we use sperm whale (<i>Physeter macrocephalus</i>) click trains in 4-minute recordings as a case study, from a dataset comprising diverse sources and deployment conditions to maximise generalisability. The value of feature extraction via the VAE is demonstrated by comparing classification performance against the traditional and explainable approach of expert handpicking of features. The TCN demonstrated robust classification capabilities achieving AUC scores exceeding 0.9.
