Table of Contents
Fetching ...

Structural and Statistical Audio Texture Knowledge Distillation (SSATKD) for Passive Sonar Classification

Jarin Ritu, Amirmohammad Mohammadi, Davelle Carreiro, Alexandra Van Dine, Joshua Peeples

TL;DR

The paper tackles passive sonar target classification by addressing the insufficiency of high-level knowledge alone in distillation. It introduces SSATKD, a framework that simultaneously distills low-level texture (structural via edge-aware, multi-scale decomposition) and statistical texture (via RBF-quantized co-occurrences) alongside traditional output distillation, all governed by an uncertainty-weighted loss. The method employs a Laplacian/Gaussian Pyramid-based structural module and a statistical module that uses 2D Earth Mover’s Distance to align texture distributions, achieving robust improvements on the DeepShip dataset with a lightweight student (HLTDNN) guided by strong PANN teachers. Key findings include superior performance when combining structural and distillation losses, the effectiveness of 4-level LP and 4-level RBF quantization, and favorable comparisons against several contemporary knowledge distillation methods, all while maintaining efficiency suitable for resource-constrained deployments. The framework has practical implications for real-time underwater signal classification and could extend to environmental sound recognition and bioacoustics with potential gains from self-supervised or multi-modal extensions.

Abstract

Knowledge distillation has been successfully applied to various audio tasks, but its potential in underwater passive sonar target classification remains relatively unexplored. Existing methods often focus on high-level contextual information while overlooking essential low-level audio texture features needed to capture local patterns in sonar data. To address this gap, the Structural and Statistical Audio Texture Knowledge Distillation (SSATKD) framework is proposed for passive sonar target classification. SSATKD combines high-level contextual information with low-level audio textures by utilizing an Edge Detection Module for structural texture extraction and a Statistical Knowledge Extractor Module to capture signal variability and distribution. Experimental results confirm that SSATKD improves classification accuracy while optimizing memory and computational resources, making it well-suited for resource-constrained environments.

Structural and Statistical Audio Texture Knowledge Distillation (SSATKD) for Passive Sonar Classification

TL;DR

The paper tackles passive sonar target classification by addressing the insufficiency of high-level knowledge alone in distillation. It introduces SSATKD, a framework that simultaneously distills low-level texture (structural via edge-aware, multi-scale decomposition) and statistical texture (via RBF-quantized co-occurrences) alongside traditional output distillation, all governed by an uncertainty-weighted loss. The method employs a Laplacian/Gaussian Pyramid-based structural module and a statistical module that uses 2D Earth Mover’s Distance to align texture distributions, achieving robust improvements on the DeepShip dataset with a lightweight student (HLTDNN) guided by strong PANN teachers. Key findings include superior performance when combining structural and distillation losses, the effectiveness of 4-level LP and 4-level RBF quantization, and favorable comparisons against several contemporary knowledge distillation methods, all while maintaining efficiency suitable for resource-constrained deployments. The framework has practical implications for real-time underwater signal classification and could extend to environmental sound recognition and bioacoustics with potential gains from self-supervised or multi-modal extensions.

Abstract

Knowledge distillation has been successfully applied to various audio tasks, but its potential in underwater passive sonar target classification remains relatively unexplored. Existing methods often focus on high-level contextual information while overlooking essential low-level audio texture features needed to capture local patterns in sonar data. To address this gap, the Structural and Statistical Audio Texture Knowledge Distillation (SSATKD) framework is proposed for passive sonar target classification. SSATKD combines high-level contextual information with low-level audio textures by utilizing an Edge Detection Module for structural texture extraction and a Statistical Knowledge Extractor Module to capture signal variability and distribution. Experimental results confirm that SSATKD improves classification accuracy while optimizing memory and computational resources, making it well-suited for resource-constrained environments.
Paper Structure (25 sections, 14 equations, 5 figures, 9 tables, 2 algorithms)

This paper contains 25 sections, 14 equations, 5 figures, 9 tables, 2 algorithms.

Figures (5)

  • Figure 1: Visualization of 4 co-occurrence matrices out of the 16 possible matrices, corresponding to a 4-level quantization process. Each matrix captures the pairwise quantization co-occurrence between adjacent spectrogram values in the feature maps. The color intensity represents the frequency of co-occurrence for each pair of quantized levels. Brighter regions (yellow to light green) indicate stronger co-occurrences, while darker regions (dark blue and purple) suggest sparse co-occurrences.
  • Figure 2: The steps in the structural module. $\mathbf{L}_0, \mathbf{L}_1, \mathbf{L}_2, \mathbf{L}_3$ represent the high-pass filtered spectrograms generated by the LP decomposition, while $\mathbf{G}_0, \mathbf{G}_1, \mathbf{G}_2, \mathbf{G}_3$ correspond to the low-pass filtered spectrograms produced by the Gaussian Pyramid (GP). $\mathbf{ED}_0, \mathbf{ED}_1, \mathbf{ED}_2, \mathbf{ED}_3$ denote the edge detection filters applied at each level.
  • Figure 3: Visualization of the 4-level LP decomposition stages. This figure illustrates the downsampling process that generates multi-scale Gaussian levels, with the LP capturing the differences between these levels. The decomposition preserves fine details across different scales, highlighting the transitions from finer to coarser resolutions.
  • Figure 4: Data preparation pipeline for the SSATKD framework. The process includes resampling original audio signals from the DeepShip dataset to 32 kHz, segmenting the signals into 5-second intervals, transforming the segments into log Mel-frequency spectrograms. These spectrograms are used as input for the SSATKD framework.
  • Figure 5: Average confusion matrices for the HLTDNN model comparing:(a) Baseline HLTDNN and (b) SSATKD HLTDNN after applying SSATKD. Each cell displays the mean and standard deviation of the predicted class samples. The SSATKD approach significantly enhances classification performance, resulting in a notable accuracy improvement from 59.62% to 66.22%, demonstrating the effectiveness of distilling texture-based knowledge.