Table of Contents
Fetching ...

Self-Supervised Enhancement of Forward-Looking Sonar Images: Bridging Cross-Modal Degradation Gaps through Feature Space Transformation and Multi-Frame Fusion

Zhisheng Zhang, Peng Zhang, Fengxiang Wang, Liangli Ma, Fuchun Sun

TL;DR

This work tackles the challenge of enhancing forward-looking sonar imagery where supervised learning requires scarce, realistic paired data and existing methods struggle with cross-modal degradation vs. remote sensing. It introduces a reference-free pipeline that combines a Deformable Wavelet Scattering Transform (WST) Feature Bridge with a self-supervised, multi-frame fusion network to suppress speckle noise and boost target brightness by leveraging inter-frame information. The approach jointly learns adaptive feature representations and frame fusion, supported by downsampling and gradient/consistency losses, and it is validated on a self-collected real-world dataset with three material targets, showing superior quantitative (STD/AG) and visual performance and enabling fast inference (~0.3 s for 16 frames). By narrowing the degradation gap and enabling potential fine-tuning of remote-sensing pretrained weights, the method promises improved robustness and generalization for underwater target detection and real-time deployment on autonomous underwater systems.

Abstract

Enhancing forward-looking sonar images is critical for accurate underwater target detection. Current deep learning methods mainly rely on supervised training with simulated data, but the difficulty in obtaining high-quality real-world paired data limits their practical use and generalization. Although self-supervised approaches from remote sensing partially alleviate data shortages, they neglect the cross-modal degradation gap between sonar and remote sensing images. Directly transferring pretrained weights often leads to overly smooth sonar images, detail loss, and insufficient brightness. To address this, we propose a feature-space transformation that maps sonar images from the pixel domain to a robust feature domain, effectively bridging the degradation gap. Additionally, our self-supervised multi-frame fusion strategy leverages complementary inter-frame information to naturally remove speckle noise and enhance target-region brightness. Experiments on three self-collected real-world forward-looking sonar datasets show that our method significantly outperforms existing approaches, effectively suppressing noise, preserving detailed edges, and substantially improving brightness, demonstrating strong potential for underwater target detection applications.

Self-Supervised Enhancement of Forward-Looking Sonar Images: Bridging Cross-Modal Degradation Gaps through Feature Space Transformation and Multi-Frame Fusion

TL;DR

This work tackles the challenge of enhancing forward-looking sonar imagery where supervised learning requires scarce, realistic paired data and existing methods struggle with cross-modal degradation vs. remote sensing. It introduces a reference-free pipeline that combines a Deformable Wavelet Scattering Transform (WST) Feature Bridge with a self-supervised, multi-frame fusion network to suppress speckle noise and boost target brightness by leveraging inter-frame information. The approach jointly learns adaptive feature representations and frame fusion, supported by downsampling and gradient/consistency losses, and it is validated on a self-collected real-world dataset with three material targets, showing superior quantitative (STD/AG) and visual performance and enabling fast inference (~0.3 s for 16 frames). By narrowing the degradation gap and enabling potential fine-tuning of remote-sensing pretrained weights, the method promises improved robustness and generalization for underwater target detection and real-time deployment on autonomous underwater systems.

Abstract

Enhancing forward-looking sonar images is critical for accurate underwater target detection. Current deep learning methods mainly rely on supervised training with simulated data, but the difficulty in obtaining high-quality real-world paired data limits their practical use and generalization. Although self-supervised approaches from remote sensing partially alleviate data shortages, they neglect the cross-modal degradation gap between sonar and remote sensing images. Directly transferring pretrained weights often leads to overly smooth sonar images, detail loss, and insufficient brightness. To address this, we propose a feature-space transformation that maps sonar images from the pixel domain to a robust feature domain, effectively bridging the degradation gap. Additionally, our self-supervised multi-frame fusion strategy leverages complementary inter-frame information to naturally remove speckle noise and enhance target-region brightness. Experiments on three self-collected real-world forward-looking sonar datasets show that our method significantly outperforms existing approaches, effectively suppressing noise, preserving detailed edges, and substantially improving brightness, demonstrating strong potential for underwater target detection applications.

Paper Structure

This paper contains 15 sections, 15 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: (a) Bridging the Cross-Modal Gap. The degradation gap between multispectral remote sensing (RS) and forward-looking sonar (FLS) images is effectively narrowed by mapping both modalities into a unified feature space via our Deformable WST Feature Bridge; (b) Enhanced Results with Our Method. Compared to the overly smoothed baseline, our method preserves clear target contours and distinct foreground-background boundaries.
  • Figure 2: Overview of the proposed method. The model combines a Deformable WST Feature Bridge—with learnable wavelet scale and orientation offsets—and a multi-frame fusion network into one end-to-end pipeline, replacing the fixed WST extractor plus simple concat-and-$1\times1$ convolution. This unified design closes the cross-modal degradation gap, reduces speckle noise, and boosts target brightness, producing images with crisper contours and richer edge detail.
  • Figure 3: Illustration of the multi-frame fusion network. The network recursively fuses aligned forward-looking sonar feature tensors in a pairwise manner until all $K$ frames are aggregated. A median reference image guides reference-free enhancement through downsampling and gradient consistency losses, enabling effective speckle noise suppression and high-resolution output (from $376\times376$ to $800\times800$).
  • Figure 4: Schematic diagram of the forward-looking sonar acquisition system and representative target images.(a) Custom-built water tank for controlled data collection; (b) C900-II sonar capable of capturing high-frame-rate sequences; (c) 360° rotating guide rail enables accurate pose estimation and multi-view imaging without additional registration; (d) three designed targets with distinct materials—rubber tire, metal torpedo model, and GRP conical frustum.
  • Figure 5: Visual comparison of enhancement performance across three target categories. Each row shows enlarged target regions (torpedo model, tire, and conical frustum) for detailed comparison. Existing methods either amplify noise or over-smooth targets, leading to blurred edges and poor contrast. In contrast, our method enhances brightness, preserves clear contours, and suppresses background noise effectively.
  • ...and 3 more figures