Table of Contents
Fetching ...

Multi-Representation Attention Framework for Underwater Bioacoustic Denoising and Recognition

Amine Razig, Youssef Soulaymani, Loubna Benabbou, Pierre Cauchy

TL;DR

The paper addresses robust marine mammal detection in challenging underwater environments by introducing a Mask-Guided Classification framework that uses spectrogram segmentation to generate pseudo-attention masks. These masks are fused with spectrogram embeddings through a mid-level fusion (including cross-attention) to guide denoising and improve species-specific recognition. Evaluations on SSLMP data show consistent gains over strong baselines, with high in-distribution accuracy and robust generalization under distributional shifts, while simpler fusion methods can offer stability under heavy OOD perturbations. The approach yields transferable, interpretable representations suitable for real-world, large-scale biodiversity monitoring, with open-source resources to support reproducibility.

Abstract

Automated monitoring of marine mammals in the St. Lawrence Estuary faces extreme challenges: calls span low-frequency moans to ultrasonic clicks, often overlap, and are embedded in variable anthropogenic and environmental noise. We introduce a multi-step, attention-guided framework that first segments spectrograms to generate soft masks of biologically relevant energy and then fuses these masks with the raw inputs for multi-band, denoised classification. Image and mask embeddings are integrated via mid-level fusion, enabling the model to focus on salient spectrogram regions while preserving global context. Using real-world recordings from the Saguenay St. Lawrence Marine Park Research Station in Canada, we demonstrate that segmentation-driven attention and mid-level fusion improve signal discrimination, reduce false positive detections, and produce reliable representations for operational marine mammal monitoring across diverse environmental conditions and signal-to-noise ratios. Beyond in-distribution evaluation, we further assess the generalization of Mask-Guided Classification (MGC) under distributional shifts by testing on spectrograms generated with alternative acoustic transformations. While high-capacity baseline models lose accuracy in this Out-of-distribution (OOD) setting, MGC maintains stable performance, with even simple fusion mechanisms (gated, concat) achieving comparable results across distributions. This robustness highlights the capacity of MGC to learn transferable representations rather than overfitting to a specific transformation, thereby reinforcing its suitability for large-scale, real-world biodiversity monitoring. We show that in all experimental settings, the MGC framework consistently outperforms baseline architectures, yielding substantial gains in accuracy on both in-distribution and OOD data.

Multi-Representation Attention Framework for Underwater Bioacoustic Denoising and Recognition

TL;DR

The paper addresses robust marine mammal detection in challenging underwater environments by introducing a Mask-Guided Classification framework that uses spectrogram segmentation to generate pseudo-attention masks. These masks are fused with spectrogram embeddings through a mid-level fusion (including cross-attention) to guide denoising and improve species-specific recognition. Evaluations on SSLMP data show consistent gains over strong baselines, with high in-distribution accuracy and robust generalization under distributional shifts, while simpler fusion methods can offer stability under heavy OOD perturbations. The approach yields transferable, interpretable representations suitable for real-world, large-scale biodiversity monitoring, with open-source resources to support reproducibility.

Abstract

Automated monitoring of marine mammals in the St. Lawrence Estuary faces extreme challenges: calls span low-frequency moans to ultrasonic clicks, often overlap, and are embedded in variable anthropogenic and environmental noise. We introduce a multi-step, attention-guided framework that first segments spectrograms to generate soft masks of biologically relevant energy and then fuses these masks with the raw inputs for multi-band, denoised classification. Image and mask embeddings are integrated via mid-level fusion, enabling the model to focus on salient spectrogram regions while preserving global context. Using real-world recordings from the Saguenay St. Lawrence Marine Park Research Station in Canada, we demonstrate that segmentation-driven attention and mid-level fusion improve signal discrimination, reduce false positive detections, and produce reliable representations for operational marine mammal monitoring across diverse environmental conditions and signal-to-noise ratios. Beyond in-distribution evaluation, we further assess the generalization of Mask-Guided Classification (MGC) under distributional shifts by testing on spectrograms generated with alternative acoustic transformations. While high-capacity baseline models lose accuracy in this Out-of-distribution (OOD) setting, MGC maintains stable performance, with even simple fusion mechanisms (gated, concat) achieving comparable results across distributions. This robustness highlights the capacity of MGC to learn transferable representations rather than overfitting to a specific transformation, thereby reinforcing its suitability for large-scale, real-world biodiversity monitoring. We show that in all experimental settings, the MGC framework consistently outperforms baseline architectures, yielding substantial gains in accuracy on both in-distribution and OOD data.

Paper Structure

This paper contains 27 sections, 14 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Saguenay–St. Lawrence Marine Park (SSLMP) representation.
  • Figure 2: End-to-end framework for automatic denoising and classification from raw audio.
  • Figure 3: Spectrogram (left), high-quality segmentation mask (middle), and generated pseudo-attention masks (3rd and 4th columns) for a recording of porpoise clicks (binary and real-valued).
  • Figure 4: Representation of origin sample on which the model is trained (left), and out-of-distribution sample from a different signal transformation ((right)
  • Figure 5: Comparison of the joint accuracy of the different models on the original data and on the out-of-distribution data. Models from the mask-guided methodology are highlighted in blue.
  • ...and 3 more figures

Theorems & Definitions (1)

  • proof