Advancing Marine Bioacoustics with Deep Generative Models: A Hybrid Augmentation Strategy for Southern Resident Killer Whale Detection
Bruno Padovese, Fabio Frazao, Michael Dowd, Ruth Joy
TL;DR
This study tackles limited annotated data in marine bioacoustics by evaluating deep generative augmentations (VAEs, GANs, DDPM) for Southern Resident Killer Whale call detection. Using two Salish Sea datasets, it compares traditional time-shifting and masking with generative methods and demonstrates that diffusion-based augmentation provides the strongest single-model gains, while a hybrid approach with time-shifting and masking yields the highest cross-site F1 score. The results indicate that combining generative and non-generative augmentations best improves generalization to unseen environments, highlighting diffusion models as a promising tool for conservation-oriented acoustic monitoring. Overall, the work advances data augmentation practices in marine bioacoustics and offers practical guidance for deploying robust SRKW detectors in variable acoustic contexts.
Abstract
Automated detection and classification of marine mammals vocalizations is critical for conservation and management efforts but is hindered by limited annotated datasets and the acoustic complexity of real-world marine environments. Data augmentation has proven to be an effective strategy to address this limitation by increasing dataset diversity and improving model generalization without requiring additional field data. However, most augmentation techniques used to date rely on effective but relatively simple transformations, leaving open the question of whether deep generative models can provide additional benefits. In this study, we evaluate the potential of deep generative for data augmentation in marine mammal call detection including: Variational Autoencoders, Generative Adversarial Networks, and Denoising Diffusion Probabilistic Models. Using Southern Resident Killer Whale (Orcinus orca) vocalizations from two long-term hydrophone deployments in the Salish Sea, we compare these approaches against traditional augmentation methods such as time-shifting and vocalization masking. While all generative approaches improved classification performance relative to the baseline, diffusion-based augmentation yielded the highest recall (0.87) and overall F1-score (0.75). A hybrid strategy combining generative-based synthesis with traditional methods achieved the best overall performance with an F1-score of 0.81. We hope this study encourages further exploration of deep generative models as complementary augmentation strategies to advance acoustic monitoring of threatened marine mammal populations.
