Table of Contents
Fetching ...

Physically Motivated Knowledge Distillation for Blind Geometric Correction of Side-Scan Sonar Imagery

Can Lei, Hayat Rajani, Valerio Franchi, Rafael Garcia, Nuno Gracias, Huigang Wang, Wei Qiang

Abstract

Side-scan sonar (SSS) imagery is susceptible to geometric distortions caused by platform motion instability, which degrade geometric consistency and limit downstream analyses such as mosaicking and perception. Conventional correction methods typically rely on navigation and attitude measurements, which are often unreliable in real ocean conditions. This unreliability necessitates blind geometric correction from a single distorted image, a highly ill-posed problem. To address this issue, we propose a physically motivated knowledge distillation framework for blind geometric correction of SSS imagery. Specifically, a teacher network is trained using paired distorted and geocoded reference images to learn distortion-related geometric differences, and this knowledge is transferred to a student network that performs correction using only a single distorted image during blind inference. To ensure physically plausible deformation estimation, we design a parametric decoder that represents distortions as row-wise affine transformations consistent with the SSS line-scanning imaging mechanism. To compensate for the absence of reference information during blind inference, a hallucination context module is introduced to approximate the teachers geometric reasoning from distorted features under a multi-level distillation scheme. In addition, a differentiable forward warping strategy is adopted to handle the non-bijective deformation characteristics of SSS imagery in an end-to-end manner. Extensive experiments on multiple datasets show that the proposed method outperforms state-of-the-art baselines and generalizes well across different platforms and acquisition conditions.

Physically Motivated Knowledge Distillation for Blind Geometric Correction of Side-Scan Sonar Imagery

Abstract

Side-scan sonar (SSS) imagery is susceptible to geometric distortions caused by platform motion instability, which degrade geometric consistency and limit downstream analyses such as mosaicking and perception. Conventional correction methods typically rely on navigation and attitude measurements, which are often unreliable in real ocean conditions. This unreliability necessitates blind geometric correction from a single distorted image, a highly ill-posed problem. To address this issue, we propose a physically motivated knowledge distillation framework for blind geometric correction of SSS imagery. Specifically, a teacher network is trained using paired distorted and geocoded reference images to learn distortion-related geometric differences, and this knowledge is transferred to a student network that performs correction using only a single distorted image during blind inference. To ensure physically plausible deformation estimation, we design a parametric decoder that represents distortions as row-wise affine transformations consistent with the SSS line-scanning imaging mechanism. To compensate for the absence of reference information during blind inference, a hallucination context module is introduced to approximate the teachers geometric reasoning from distorted features under a multi-level distillation scheme. In addition, a differentiable forward warping strategy is adopted to handle the non-bijective deformation characteristics of SSS imagery in an end-to-end manner. Extensive experiments on multiple datasets show that the proposed method outperforms state-of-the-art baselines and generalizes well across different platforms and acquisition conditions.
Paper Structure (45 sections, 30 equations, 8 figures, 3 tables)

This paper contains 45 sections, 30 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview of the proposed physically motivated knowledge distillation framework for blind geometric correction of side-scan sonar imagery. The Teacher network is fed with paired distorted and geocoded images $\{I_m, I_f\}$ to explicitly extract geometric difference features and estimate a physically motivated deformation flow. The Student network operates under blind conditions using only distorted input $I_m$, where a hallucination context module approximates the missing geometric difference and is guided by multi-level distillation from the Teacher. Both networks share a physically motivated parametric decoder and employ differentiable forward warping to obtain the corrected image.
  • Figure 2: Structure of the shared encoder used in both Teacher and Student networks. An initial $3\times3$ convolution with Instance Normalization (IN) projects the input image into a 64-channel feature space, followed by five cascaded Sonar-Residual Blocks (ResSBlock). The first stage preserves spatial resolution, while the remaining stages progressively downsample the feature maps, producing a compact and noise-robust representation $\mathbf{F}_{enc}$ for subsequent geometric inference.
  • Figure 3: Structure of the physically motivated parametric decoder. Encoder features $\mathbf{F}_{enc}$ and difference features $\mathbf{F}_{diff}$ are concatenated and progressively upsampled through a hierarchical decoder composed of Residual Decoding Blocks (ResDecBlock) and a Plain Decoding Block (PlainDecBlock), producing a dense feature map $\mathbf{F}_{dec}$. A physically motivated head then aggregates features along each scan line and regresses row-wise affine parameters, which are projected into a dense deformation flow field $\mathbf{\Phi}$ consistent with the line-scanning geometry of side-scan sonar imagery.
  • Figure 4: Structure of the Hallucination Context Module (HCM). Given distorted encoder features $\hat{\mathbf{F}}_{enc}^m$, the HCM predicts a hallucinated geometric difference $\hat{\mathbf{F}}_{diff}$ through three stages: local conditioning with a $3\times3$ convolution, global context aggregation using a dilated $3\times3$ convolution, and difference projection through a $1\times1$ convolution. A key component is the Dilated Convolution (right side), which expands the receptive field to capture long-range, non-local distortion dependencies.
  • Figure 5: The Differentiable Forward Warping framework. Given the estimated deformation flow $\boldsymbol{\Phi}$, source pixels in the distorted image $I_m$ are forward-projected to the target grid through soft splatting, producing a forward-projected image and a splatting density map that shows unsampled regions (holes). An iterative hole-filling process based on Gaussian smoothing and normalization is then applied to diffuse valid measurements into holes, reconstructing the final corrected image $I_c$.
  • ...and 3 more figures