WhaleNet: a Novel Deep Learning Architecture for Marine Mammals Vocalizations on Watkins Marine Mammal Sound Database

Alessandro Licciardi; Davide Carbone

WhaleNet: a Novel Deep Learning Architecture for Marine Mammals Vocalizations on Watkins Marine Mammal Sound Database

Alessandro Licciardi, Davide Carbone

TL;DR

This work tackles the challenge of classifying marine mammal vocalizations in the heterogeneous Watkins Marine Mammal Sound Database (WMMD). It introduces WhaleNet, a deep ensemble architecture that fuses Wavelet Scattering Transform (WST) features with Mel spectrograms through three parallel ResNet branches and an MLP-based merger, reporting substantial performance gains. The authors demonstrate an 8–10 percentage point improvement over prior benchmarks, achieving about 97.6% accuracy on the full WMMD and surpassing 99% with ensemble merging, underscoring the practical impact for automated bioacoustic monitoring and conservation. The study also provides a public data-prep pipeline and highlights the utility of WST for multiscale, naturaltime-series signals, offering a scalable approach for complex, real-world datasets in marine bioacoustics.

Abstract

Marine mammal communication is a complex field, hindered by the diversity of vocalizations and environmental factors. The Watkins Marine Mammal Sound Database (WMMD) constitutes a comprehensive labeled dataset employed in machine learning applications. Nevertheless, the methodologies for data preparation, preprocessing, and classification documented in the literature exhibit considerable variability and are typically not applied to the dataset in its entirety. This study initially undertakes a concise review of the state-of-the-art benchmarks pertaining to the dataset, with a particular focus on clarifying data preparation and preprocessing techniques. Subsequently, we explore the utilization of the Wavelet Scattering Transform (WST) and Mel spectrogram as preprocessing mechanisms for feature extraction. In this paper, we introduce \textbf{WhaleNet} (Wavelet Highly Adaptive Learning Ensemble Network), a sophisticated deep ensemble architecture for the classification of marine mammal vocalizations, leveraging both WST and Mel spectrogram for enhanced feature discrimination. By integrating the insights derived from WST and Mel representations, we achieved an improvement in classification accuracy by $8-10\%$ over existing architectures, corresponding to a classification accuracy of $97.61\%$.

WhaleNet: a Novel Deep Learning Architecture for Marine Mammals Vocalizations on Watkins Marine Mammal Sound Database

TL;DR

Abstract

over existing architectures, corresponding to a classification accuracy of

Paper Structure (17 sections, 15 equations, 7 figures, 3 tables, 3 algorithms)

This paper contains 17 sections, 15 equations, 7 figures, 3 tables, 3 algorithms.

Introduction
Preprocessing techniques
STFT and Mel Spectrogram
Wavelet Scattering Transform
Training and test datasets set-up
Watkins Marine Mammal Sound Database
Data processing
Training and Test Datasets
Software and Computational Resources
Model Architecture Design
Residual Learning
WhaleNet Architecture
Hyper-parameters
Metrics for Performance Evaluation
Results and Discussions
...and 2 more sections

Figures (7)

Figure 1: From left: Mel spectrogram, WST of first and second order for vocalizations of two different species of whales. The displayed WSTs correspond to the choice $(J,Q)=(7,10)$. Focusing on the second row, it is graphically evident the correspondence of a high-depth scale for WST with low frequency in the spectrogram. Mel spectrogram appears to be more coarse-grained with respect to first-order WST, even if the overall heatmaps appear to be similar. Each figure is resized to be squared for visualization purposes. The shapes of the images in each row are, from left, respectively 41$\times$64 for Mel spectrogram and 53$\times$63 and 158$\times$63 for first and second order WST.
Figure 2: Wavelet Scattering Transform as an iterative process; image taken from anden2014deep. In their notation the signal is $x(t)=h(t)$ the path $p$ at depth $m$ is explicited in parentheses as a tuple $(\lambda_1,\dots,\lambda_m)$. Each black dot corresponds to a scattering coefficient.
Figure 3: Number of samples per class after data preparation and elimination of duplicates, in log-scale and sorted in decreasing order. The dataset is very imbalanced: the most represented class contains $2637$ instances, while the smallest one just $15$.
Figure 4: Structure of the residual block used in the full architecture \ref{['fig:archi']}. The acronym "BN" stands for batch normalization.
Figure 5: Structure of the architecture employed in the classification task. The residual blocks are unfolded in Figure \ref{['fig:block']}. The acronym "BN" stands for batch normalization, while "FC 64" denotes a fully connected layer with input dimension $64$ and output dimension $32$, corresponding to the number of classes. The number of trainable parameters is $176400$. For a comparison, AlexNet krizhevsky2017imagenet, which is employed in lu2021detection for a different classification task on WMMD, has 62.3 million of parameters.
...and 2 more figures

Theorems & Definitions (2)

Definition 2.1
Definition 2.2

WhaleNet: a Novel Deep Learning Architecture for Marine Mammals Vocalizations on Watkins Marine Mammal Sound Database

TL;DR

Abstract

WhaleNet: a Novel Deep Learning Architecture for Marine Mammals Vocalizations on Watkins Marine Mammal Sound Database

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (2)