WhaleNet: a Novel Deep Learning Architecture for Marine Mammals Vocalizations on Watkins Marine Mammal Sound Database
Alessandro Licciardi, Davide Carbone
TL;DR
This work tackles the challenge of classifying marine mammal vocalizations in the heterogeneous Watkins Marine Mammal Sound Database (WMMD). It introduces WhaleNet, a deep ensemble architecture that fuses Wavelet Scattering Transform (WST) features with Mel spectrograms through three parallel ResNet branches and an MLP-based merger, reporting substantial performance gains. The authors demonstrate an 8–10 percentage point improvement over prior benchmarks, achieving about 97.6% accuracy on the full WMMD and surpassing 99% with ensemble merging, underscoring the practical impact for automated bioacoustic monitoring and conservation. The study also provides a public data-prep pipeline and highlights the utility of WST for multiscale, naturaltime-series signals, offering a scalable approach for complex, real-world datasets in marine bioacoustics.
Abstract
Marine mammal communication is a complex field, hindered by the diversity of vocalizations and environmental factors. The Watkins Marine Mammal Sound Database (WMMD) constitutes a comprehensive labeled dataset employed in machine learning applications. Nevertheless, the methodologies for data preparation, preprocessing, and classification documented in the literature exhibit considerable variability and are typically not applied to the dataset in its entirety. This study initially undertakes a concise review of the state-of-the-art benchmarks pertaining to the dataset, with a particular focus on clarifying data preparation and preprocessing techniques. Subsequently, we explore the utilization of the Wavelet Scattering Transform (WST) and Mel spectrogram as preprocessing mechanisms for feature extraction. In this paper, we introduce \textbf{WhaleNet} (Wavelet Highly Adaptive Learning Ensemble Network), a sophisticated deep ensemble architecture for the classification of marine mammal vocalizations, leveraging both WST and Mel spectrogram for enhanced feature discrimination. By integrating the insights derived from WST and Mel representations, we achieved an improvement in classification accuracy by $8-10\%$ over existing architectures, corresponding to a classification accuracy of $97.61\%$.
