Table of Contents
Fetching ...

Investigation of Time-Frequency Feature Combinations with Histogram Layer Time Delay Neural Networks

Amirmohammad Mohammadi, Iren'e Masabarakiza, Ethan Barnes, Davelle Carreiro, Alexandra Van Dine, Joshua Peeples

TL;DR

This work tackles underwater acoustic target recognition by evaluating how combinations of time-frequency features influence HLTDNN performance with a histogram-layer. It introduces adaptive padding-based fusion of six spectrogram features, producing 63 combinations, and demonstrates that the best mix (VQT, MFCC, STFT, GFCC) reaches 66.17% accuracy, beating MFCC alone at 59.34%. Explainability analyses via confusion matrices, t-SNE, and FullCAM indicate improved class separability and frequency-region focus with feature fusion. The findings highlight the value of multi-spectrogram representations for robust UATR and point to avenues for end-to-end learning and more efficient feature selection.

Abstract

While deep learning has reduced the prevalence of manual feature extraction, transformation of data via feature engineering remains essential for improving model performance, particularly for underwater acoustic signals. The methods by which audio signals are converted into time-frequency representations and the subsequent handling of these spectrograms can significantly impact performance. This work demonstrates the performance impact of using different combinations of time-frequency features in a histogram layer time delay neural network. An optimal set of features is identified with results indicating that specific feature combinations outperform single data features.

Investigation of Time-Frequency Feature Combinations with Histogram Layer Time Delay Neural Networks

TL;DR

This work tackles underwater acoustic target recognition by evaluating how combinations of time-frequency features influence HLTDNN performance with a histogram-layer. It introduces adaptive padding-based fusion of six spectrogram features, producing 63 combinations, and demonstrates that the best mix (VQT, MFCC, STFT, GFCC) reaches 66.17% accuracy, beating MFCC alone at 59.34%. Explainability analyses via confusion matrices, t-SNE, and FullCAM indicate improved class separability and frequency-region focus with feature fusion. The findings highlight the value of multi-spectrogram representations for robust UATR and point to avenues for end-to-end learning and more efficient feature selection.

Abstract

While deep learning has reduced the prevalence of manual feature extraction, transformation of data via feature engineering remains essential for improving model performance, particularly for underwater acoustic signals. The methods by which audio signals are converted into time-frequency representations and the subsequent handling of these spectrograms can significantly impact performance. This work demonstrates the performance impact of using different combinations of time-frequency features in a histogram layer time delay neural network. An optimal set of features is identified with results indicating that specific feature combinations outperform single data features.
Paper Structure (7 sections, 1 equation, 4 figures, 1 table)

This paper contains 7 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overall workflow. The audio segments are first partitioned into individual segments. Each audio segment is then used to compute an individual feature of varying time and frequency resolutions. We introduce an adaptive padding layer to make each spectrogram the same size. The features are then concatenated along the channel dimension and processed by the HLTDNN model for classification.
  • Figure 2: Comparison of confusion matrix results for (a) best combination of features (66.17 $\pm$ 1.10%) and (b) MFCC alone (59.34 $\pm$ 3.16%).
  • Figure 3: t-SNE results for (a) best feature combination (FDR: 61.26 $\pm$ 4.52) and (b) MFCC (FDR: 29.04 $\pm$ 2.49). Colors represent four classes of ships. The average log Fisher Discriminant Ratio (FDR) ($\pm 1\sigma$) is also shown, with higher scores indicating more compact and better-separated classes. The best random seed is used for each feature.
  • Figure 4: Class Activation Maps (CAM) using the best feature combination and MFCC alone for the Cargo class. The third row compares the CAM overlay on the MFCC for both models. (a) Original VQT, (b) Original STFT, (c) Original GFCC, (d) Best CAM for VQT, (e) Best CAM for STFT, (f) Best CAM for GFCC, (g) Original MFCC, (h) Best CAM, (i) MFCC CAM.