Table of Contents
Fetching ...

Advancing Marine Bioacoustics with Deep Generative Models: A Hybrid Augmentation Strategy for Southern Resident Killer Whale Detection

Bruno Padovese, Fabio Frazao, Michael Dowd, Ruth Joy

TL;DR

This study tackles limited annotated data in marine bioacoustics by evaluating deep generative augmentations (VAEs, GANs, DDPM) for Southern Resident Killer Whale call detection. Using two Salish Sea datasets, it compares traditional time-shifting and masking with generative methods and demonstrates that diffusion-based augmentation provides the strongest single-model gains, while a hybrid approach with time-shifting and masking yields the highest cross-site F1 score. The results indicate that combining generative and non-generative augmentations best improves generalization to unseen environments, highlighting diffusion models as a promising tool for conservation-oriented acoustic monitoring. Overall, the work advances data augmentation practices in marine bioacoustics and offers practical guidance for deploying robust SRKW detectors in variable acoustic contexts.

Abstract

Automated detection and classification of marine mammals vocalizations is critical for conservation and management efforts but is hindered by limited annotated datasets and the acoustic complexity of real-world marine environments. Data augmentation has proven to be an effective strategy to address this limitation by increasing dataset diversity and improving model generalization without requiring additional field data. However, most augmentation techniques used to date rely on effective but relatively simple transformations, leaving open the question of whether deep generative models can provide additional benefits. In this study, we evaluate the potential of deep generative for data augmentation in marine mammal call detection including: Variational Autoencoders, Generative Adversarial Networks, and Denoising Diffusion Probabilistic Models. Using Southern Resident Killer Whale (Orcinus orca) vocalizations from two long-term hydrophone deployments in the Salish Sea, we compare these approaches against traditional augmentation methods such as time-shifting and vocalization masking. While all generative approaches improved classification performance relative to the baseline, diffusion-based augmentation yielded the highest recall (0.87) and overall F1-score (0.75). A hybrid strategy combining generative-based synthesis with traditional methods achieved the best overall performance with an F1-score of 0.81. We hope this study encourages further exploration of deep generative models as complementary augmentation strategies to advance acoustic monitoring of threatened marine mammal populations.

Advancing Marine Bioacoustics with Deep Generative Models: A Hybrid Augmentation Strategy for Southern Resident Killer Whale Detection

TL;DR

This study tackles limited annotated data in marine bioacoustics by evaluating deep generative augmentations (VAEs, GANs, DDPM) for Southern Resident Killer Whale call detection. Using two Salish Sea datasets, it compares traditional time-shifting and masking with generative methods and demonstrates that diffusion-based augmentation provides the strongest single-model gains, while a hybrid approach with time-shifting and masking yields the highest cross-site F1 score. The results indicate that combining generative and non-generative augmentations best improves generalization to unseen environments, highlighting diffusion models as a promising tool for conservation-oriented acoustic monitoring. Overall, the work advances data augmentation practices in marine bioacoustics and offers practical guidance for deploying robust SRKW detectors in variable acoustic contexts.

Abstract

Automated detection and classification of marine mammals vocalizations is critical for conservation and management efforts but is hindered by limited annotated datasets and the acoustic complexity of real-world marine environments. Data augmentation has proven to be an effective strategy to address this limitation by increasing dataset diversity and improving model generalization without requiring additional field data. However, most augmentation techniques used to date rely on effective but relatively simple transformations, leaving open the question of whether deep generative models can provide additional benefits. In this study, we evaluate the potential of deep generative for data augmentation in marine mammal call detection including: Variational Autoencoders, Generative Adversarial Networks, and Denoising Diffusion Probabilistic Models. Using Southern Resident Killer Whale (Orcinus orca) vocalizations from two long-term hydrophone deployments in the Salish Sea, we compare these approaches against traditional augmentation methods such as time-shifting and vocalization masking. While all generative approaches improved classification performance relative to the baseline, diffusion-based augmentation yielded the highest recall (0.87) and overall F1-score (0.75). A hybrid strategy combining generative-based synthesis with traditional methods achieved the best overall performance with an F1-score of 0.81. We hope this study encourages further exploration of deep generative models as complementary augmentation strategies to advance acoustic monitoring of threatened marine mammal populations.

Paper Structure

This paper contains 21 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Locations of the two hydrophone deployments in the Salish Sea (Lime Kiln and Roberts Bank). The commercial shipping lanes are shown as black lines.
  • Figure 2: Overview of the vocalization mask construction process. Starting from the original spectrogram (left), we first apply PCA-based background subtraction to emphasize vocalization components (center). A subsequent percentile-based thresholding step produces a sparse, high-contrast mask (right) that preserves the primary vocal features while suppressing residual background noise.
  • Figure 3: Illustration of a Denoising Diffusion Probabilistic Model (DDPM) for synthesizing spectrograms. The forward diffusion process (solid arrow) incrementally corrupts a clean input spectrogram $x_0$ into pure noise $x_T$ over $T$ timesteps. The reverse diffusion process (dashed arrow) learns to denoise by training a U-Net to predict and remove noise at each step.
  • Figure 4: Examples of synthetic spectrograms generated using DDPMs. The top row shows samples that were accepted for training, exhibiting clear SRKW-like vocal structure. The bottom row shows samples that are unsuitable due to the presence of artifacts or poor signal definition.
  • Figure 5: Examples of (a) real SRKW vocalizations, (b) VAE-generated samples, (c) GAN-generated samples, and (d) DDPM-generated samples.
  • ...and 1 more figures