Table of Contents
Fetching ...

Positive-Unlabelled Active Learning to Curate a Dataset for Orca Resident Interpretation

Bret Nestor, Bohan Yao, Jasmine Moore, Jasper Kanes

TL;DR

This work tackles the challenge of curating a comprehensive, publicly discoverable dataset of Southern Resident Killer Whale acoustics by combining positive-unlabelled active learning with transformer-based detectors. Leveraging over 30 years of archival hydrophone data and multiple public data sources, the authors train compact embeddings from Whisper-tiny and apply a range of labeling and active-learning strategies to identify SRKW, ecotypes, and other marine mammals. The result is DORI, the largest curated dataset to date for SRKW interpretation, accompanied by state-of-the-art detectors and classifiers that outperform existing baselines on several benchmarks, with practical advantages in speed and energy efficiency. The dataset and tools support habitat usage studies, unsupervised translation research, and conservation efforts, while engaging citizen scientists to broaden labeling capacity and accessibility.

Abstract

This work presents the largest curation of Southern Resident Killer Whale (SRKW) acoustic data to date, also containing other marine mammals in their environment. We systematically search all available public archival hydrophone data within the SRKW habitat (over 30 years of audio data). The search consists of a weakly-supervised, positive-unlabelled, active learning strategy to identify all instances of marine mammals. The resulting transformer-based detectors outperform state-of-the-art detectors on the DEEPAL, DCLDE-2026, and two newly introduced expert-annotated datasets in terms of accuracy, energy efficiency, and speed. The detection model has a specificity of 0-28.8% at 95% sensitivity. Our multiclass species classifier obtains a top-1 accuracy of 42.1% (11 train classes, 4 test classes) and our ecotype classifier obtains a top-1 accuracy of 43.0% (4 train classes, 5 test classes) on the DCLDE-2026 dataset. We yield 919 hours of SRKW data, 230 hours of Bigg's orca data, 1374 hours of orca data from unlabelled ecotypes, 1501 hours of humpback data, 88 hours of sea lion data, 246 hours of pacific white-sided dolphin data, and over 784 hours of unspecified marine mammal data. This SRKW dataset is larger than DCLDE-2026, Ocean Networks Canada, and OrcaSound combined. The curated species labels are available under CC-BY 4.0 license, and the corresponding audio data are available under the licenses of the original owners. The comprehensive nature of this dataset makes it suitable for unsupervised machine translation, habitat usage surveys, and conservation endeavours for this critically endangered ecotype.

Positive-Unlabelled Active Learning to Curate a Dataset for Orca Resident Interpretation

TL;DR

This work tackles the challenge of curating a comprehensive, publicly discoverable dataset of Southern Resident Killer Whale acoustics by combining positive-unlabelled active learning with transformer-based detectors. Leveraging over 30 years of archival hydrophone data and multiple public data sources, the authors train compact embeddings from Whisper-tiny and apply a range of labeling and active-learning strategies to identify SRKW, ecotypes, and other marine mammals. The result is DORI, the largest curated dataset to date for SRKW interpretation, accompanied by state-of-the-art detectors and classifiers that outperform existing baselines on several benchmarks, with practical advantages in speed and energy efficiency. The dataset and tools support habitat usage studies, unsupervised translation research, and conservation efforts, while engaging citizen scientists to broaden labeling capacity and accessibility.

Abstract

This work presents the largest curation of Southern Resident Killer Whale (SRKW) acoustic data to date, also containing other marine mammals in their environment. We systematically search all available public archival hydrophone data within the SRKW habitat (over 30 years of audio data). The search consists of a weakly-supervised, positive-unlabelled, active learning strategy to identify all instances of marine mammals. The resulting transformer-based detectors outperform state-of-the-art detectors on the DEEPAL, DCLDE-2026, and two newly introduced expert-annotated datasets in terms of accuracy, energy efficiency, and speed. The detection model has a specificity of 0-28.8% at 95% sensitivity. Our multiclass species classifier obtains a top-1 accuracy of 42.1% (11 train classes, 4 test classes) and our ecotype classifier obtains a top-1 accuracy of 43.0% (4 train classes, 5 test classes) on the DCLDE-2026 dataset. We yield 919 hours of SRKW data, 230 hours of Bigg's orca data, 1374 hours of orca data from unlabelled ecotypes, 1501 hours of humpback data, 88 hours of sea lion data, 246 hours of pacific white-sided dolphin data, and over 784 hours of unspecified marine mammal data. This SRKW dataset is larger than DCLDE-2026, Ocean Networks Canada, and OrcaSound combined. The curated species labels are available under CC-BY 4.0 license, and the corresponding audio data are available under the licenses of the original owners. The comprehensive nature of this dataset makes it suitable for unsupervised machine translation, habitat usage surveys, and conservation endeavours for this critically endangered ecotype.
Paper Structure (27 sections, 2 equations, 3 figures, 3 tables)

This paper contains 27 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Hydrophone deployment sites where data are collected from.
  • Figure 2: Active learning without noisy labels. Mixing positive and active learning helps discover samples with no penalty compared to entropy-only sampling.
  • Figure 3: Active learning with label noise. At each iteration 30% of positively labelled samples are assigned flipped to negatives. In this setting, entropy-only sampling slightly outperforms the mixed strategies.