Self-Supervised Enhancement of Forward-Looking Sonar Images: Bridging Cross-Modal Degradation Gaps through Feature Space Transformation and Multi-Frame Fusion
Zhisheng Zhang, Peng Zhang, Fengxiang Wang, Liangli Ma, Fuchun Sun
TL;DR
This work tackles the challenge of enhancing forward-looking sonar imagery where supervised learning requires scarce, realistic paired data and existing methods struggle with cross-modal degradation vs. remote sensing. It introduces a reference-free pipeline that combines a Deformable Wavelet Scattering Transform (WST) Feature Bridge with a self-supervised, multi-frame fusion network to suppress speckle noise and boost target brightness by leveraging inter-frame information. The approach jointly learns adaptive feature representations and frame fusion, supported by downsampling and gradient/consistency losses, and it is validated on a self-collected real-world dataset with three material targets, showing superior quantitative (STD/AG) and visual performance and enabling fast inference (~0.3 s for 16 frames). By narrowing the degradation gap and enabling potential fine-tuning of remote-sensing pretrained weights, the method promises improved robustness and generalization for underwater target detection and real-time deployment on autonomous underwater systems.
Abstract
Enhancing forward-looking sonar images is critical for accurate underwater target detection. Current deep learning methods mainly rely on supervised training with simulated data, but the difficulty in obtaining high-quality real-world paired data limits their practical use and generalization. Although self-supervised approaches from remote sensing partially alleviate data shortages, they neglect the cross-modal degradation gap between sonar and remote sensing images. Directly transferring pretrained weights often leads to overly smooth sonar images, detail loss, and insufficient brightness. To address this, we propose a feature-space transformation that maps sonar images from the pixel domain to a robust feature domain, effectively bridging the degradation gap. Additionally, our self-supervised multi-frame fusion strategy leverages complementary inter-frame information to naturally remove speckle noise and enhance target-region brightness. Experiments on three self-collected real-world forward-looking sonar datasets show that our method significantly outperforms existing approaches, effectively suppressing noise, preserving detailed edges, and substantially improving brightness, demonstrating strong potential for underwater target detection applications.
