Positive-Unlabelled Active Learning to Curate a Dataset for Orca Resident Interpretation

Bret Nestor; Bohan Yao; Jasmine Moore; Jasper Kanes

Positive-Unlabelled Active Learning to Curate a Dataset for Orca Resident Interpretation

Bret Nestor, Bohan Yao, Jasmine Moore, Jasper Kanes

TL;DR

This work tackles the challenge of curating a comprehensive, publicly discoverable dataset of Southern Resident Killer Whale acoustics by combining positive-unlabelled active learning with transformer-based detectors. Leveraging over 30 years of archival hydrophone data and multiple public data sources, the authors train compact embeddings from Whisper-tiny and apply a range of labeling and active-learning strategies to identify SRKW, ecotypes, and other marine mammals. The result is DORI, the largest curated dataset to date for SRKW interpretation, accompanied by state-of-the-art detectors and classifiers that outperform existing baselines on several benchmarks, with practical advantages in speed and energy efficiency. The dataset and tools support habitat usage studies, unsupervised translation research, and conservation efforts, while engaging citizen scientists to broaden labeling capacity and accessibility.

Abstract

This work presents the largest curation of Southern Resident Killer Whale (SRKW) acoustic data to date, also containing other marine mammals in their environment. We systematically search all available public archival hydrophone data within the SRKW habitat (over 30 years of audio data). The search consists of a weakly-supervised, positive-unlabelled, active learning strategy to identify all instances of marine mammals. The resulting transformer-based detectors outperform state-of-the-art detectors on the DEEPAL, DCLDE-2026, and two newly introduced expert-annotated datasets in terms of accuracy, energy efficiency, and speed. The detection model has a specificity of 0-28.8% at 95% sensitivity. Our multiclass species classifier obtains a top-1 accuracy of 42.1% (11 train classes, 4 test classes) and our ecotype classifier obtains a top-1 accuracy of 43.0% (4 train classes, 5 test classes) on the DCLDE-2026 dataset. We yield 919 hours of SRKW data, 230 hours of Bigg's orca data, 1374 hours of orca data from unlabelled ecotypes, 1501 hours of humpback data, 88 hours of sea lion data, 246 hours of pacific white-sided dolphin data, and over 784 hours of unspecified marine mammal data. This SRKW dataset is larger than DCLDE-2026, Ocean Networks Canada, and OrcaSound combined. The curated species labels are available under CC-BY 4.0 license, and the corresponding audio data are available under the licenses of the original owners. The comprehensive nature of this dataset makes it suitable for unsupervised machine translation, habitat usage surveys, and conservation endeavours for this critically endangered ecotype.

Positive-Unlabelled Active Learning to Curate a Dataset for Orca Resident Interpretation

TL;DR

Abstract

Positive-Unlabelled Active Learning to Curate a Dataset for Orca Resident Interpretation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)