Table of Contents
Fetching ...

Temporal Feature Learning in Weakly Labelled Bioacoustic Cetacean Datasets via a Variational Autoencoder and Temporal Convolutional Network: An Interdisciplinary Approach

Laia Garrobé Fonollosa, Douglas Gillespie, Lina Stankovic, Vladimir Stankovic, Luke Rendell

TL;DR

The paper tackles the challenge of weakly labelled passive acoustic monitoring (PAM) data for cetaceans by introducing a VAE–TCN framework that learns robust temporal features from diverse, multi-source datasets without hand-tuned thresholds. It standardises recordings into $4$-minute segments (and a $30$-second dataset for higher-resolution features), compares handcrafted features against variational autoencoder (VAE) embeddings, and uses Temporal Convolutional Networks (TCNs) to classify long sequences. The main findings show that 2D-VAE embeddings of spectrograms fed into a TCN yield high recall with low false-positive rates across deployments, achieving AUC above $0.9$, and that shorter $30$-second annotations can further improve performance for some feature types. This approach enables scalable analysis of large, weakly labelled PAM datasets and holds potential for transfer to other acoustic domains, alongside providing open data and code to promote reproducibility.

Abstract

Bioacoustics data from Passive acoustic monitoring (PAM) poses a unique set of challenges for classification, particularly the limited availability of complete and reliable labels in datasets due to annotation uncertainty, biological complexity due the heterogeneity in duration of cetacean vocalizations, and masking of target sounds due to environmental and anthropogenic noise. This means that data is often weakly labelled, with annotations indicating presence/absence of species over several minutes. In order to effectively capture the complex temporal patterns and key features of lengthy continuous audio segments, we propose an interdisciplinary framework comprising dataset standardisation, feature extraction via Variational Autoencoders (VAE) and classification via Temporal Convolutional Networks (TCN). This approach eliminates the necessity for manual threshold setting or time-consuming strong labelling. To demonstrate the effectiveness of our approach, we use sperm whale (<i>Physeter macrocephalus</i>) click trains in 4-minute recordings as a case study, from a dataset comprising diverse sources and deployment conditions to maximise generalisability. The value of feature extraction via the VAE is demonstrated by comparing classification performance against the traditional and explainable approach of expert handpicking of features. The TCN demonstrated robust classification capabilities achieving AUC scores exceeding 0.9.

Temporal Feature Learning in Weakly Labelled Bioacoustic Cetacean Datasets via a Variational Autoencoder and Temporal Convolutional Network: An Interdisciplinary Approach

TL;DR

The paper tackles the challenge of weakly labelled passive acoustic monitoring (PAM) data for cetaceans by introducing a VAE–TCN framework that learns robust temporal features from diverse, multi-source datasets without hand-tuned thresholds. It standardises recordings into -minute segments (and a -second dataset for higher-resolution features), compares handcrafted features against variational autoencoder (VAE) embeddings, and uses Temporal Convolutional Networks (TCNs) to classify long sequences. The main findings show that 2D-VAE embeddings of spectrograms fed into a TCN yield high recall with low false-positive rates across deployments, achieving AUC above , and that shorter -second annotations can further improve performance for some feature types. This approach enables scalable analysis of large, weakly labelled PAM datasets and holds potential for transfer to other acoustic domains, alongside providing open data and code to promote reproducibility.

Abstract

Bioacoustics data from Passive acoustic monitoring (PAM) poses a unique set of challenges for classification, particularly the limited availability of complete and reliable labels in datasets due to annotation uncertainty, biological complexity due the heterogeneity in duration of cetacean vocalizations, and masking of target sounds due to environmental and anthropogenic noise. This means that data is often weakly labelled, with annotations indicating presence/absence of species over several minutes. In order to effectively capture the complex temporal patterns and key features of lengthy continuous audio segments, we propose an interdisciplinary framework comprising dataset standardisation, feature extraction via Variational Autoencoders (VAE) and classification via Temporal Convolutional Networks (TCN). This approach eliminates the necessity for manual threshold setting or time-consuming strong labelling. To demonstrate the effectiveness of our approach, we use sperm whale (<i>Physeter macrocephalus</i>) click trains in 4-minute recordings as a case study, from a dataset comprising diverse sources and deployment conditions to maximise generalisability. The value of feature extraction via the VAE is demonstrated by comparing classification performance against the traditional and explainable approach of expert handpicking of features. The TCN demonstrated robust classification capabilities achieving AUC scores exceeding 0.9.

Paper Structure

This paper contains 14 sections, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Schematic diagram outlining the methodology employed to develop an acoustic classification framework from diverse datasets of weakly labelled data. The four main steps followed for this study are (1) standardising annotations and recordings from different sources and creating two datasets that were representative of the variabilities in anthropogenic and environmental noise, (2) VAE-based feature extraction, (3) detection and classification of sperm whale click trains based on TCN, and finally (4), evaluation of the value of feature extraction vs handpicked, expert-led features and annotation length of the temporal sequence.
  • Figure 2: Impulsive noise detector algorithm
  • Figure 3: Parametrisation of a sperm whale click waveform and spectrum using the handpicked features described in Table \ref{['tab:handcrafted']}.
  • Figure 4: Schematic of the TCN architecture (left), a temporal block in the TCN (middle) and a dilated casual convolutional layer with a kernel size (k) of 3 and a dilation (d) of 2). The deep stack of dilated convolutions allows the TCN to capture long-range temporal patterns, making it a suitable architecture to detect the rythmic vocalisation patterns of sperm whales.
  • Figure 5: Recall - false positive rate (FPR) relation on the validation set of 4-minute dataset for TCNs trained on parameters extracted using VAEs and traditional methods for acoustic feature extraction. Number in parenthesis refers to the size of the window over which the feature extraction was performed. Size of the extracted parameters for each of the non-overlapping windows is displayed in text next to each point. The most efficient detectors will be in the lower right quadrant, which in this case are the ones that work on sequences of VAE embeddings of spectrograms.
  • ...and 3 more figures