Table of Contents
Fetching ...

SPADE: Sparsity Adaptive Depth Estimator for Zero-Shot, Real-Time, Monocular Depth Estimation in Underwater Environments

Hongjie Zhang, Gideon Billings, Stefan B. Williams

Abstract

Underwater infrastructure requires frequent inspection and maintenance due to harsh marine conditions. Current reliance on human divers or remotely operated vehicles is limited by perceptual and operational challenges, especially around complex structures or in turbid water. Enhancing the spatial awareness of underwater vehicles is key to reducing piloting risks and enabling greater autonomy. To address these challenges, we present SPADE: SParsity Adaptive Depth Estimator, a monocular depth estimation pipeline that combines pre-trained relative depth estimator with sparse depth priors to produce dense, metric scale depth maps. Our two-stage approach first scales the relative depth map with the sparse depth points, then refines the final metric prediction with our proposed Cascade Conv-Deformable Transformer blocks. Our approach achieves improved accuracy and generalisation over state-of-the-art baselines and runs efficiently at over 15 FPS on embedded hardware, promising to support practical underwater inspection and intervention. This work has been submitted to IEEE Journal of Oceanic Engineering Special Issue of AUV 2026.

SPADE: Sparsity Adaptive Depth Estimator for Zero-Shot, Real-Time, Monocular Depth Estimation in Underwater Environments

Abstract

Underwater infrastructure requires frequent inspection and maintenance due to harsh marine conditions. Current reliance on human divers or remotely operated vehicles is limited by perceptual and operational challenges, especially around complex structures or in turbid water. Enhancing the spatial awareness of underwater vehicles is key to reducing piloting risks and enabling greater autonomy. To address these challenges, we present SPADE: SParsity Adaptive Depth Estimator, a monocular depth estimation pipeline that combines pre-trained relative depth estimator with sparse depth priors to produce dense, metric scale depth maps. Our two-stage approach first scales the relative depth map with the sparse depth points, then refines the final metric prediction with our proposed Cascade Conv-Deformable Transformer blocks. Our approach achieves improved accuracy and generalisation over state-of-the-art baselines and runs efficiently at over 15 FPS on embedded hardware, promising to support practical underwater inspection and intervention. This work has been submitted to IEEE Journal of Oceanic Engineering Special Issue of AUV 2026.

Paper Structure

This paper contains 22 sections, 14 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Our proposed monocular depth estimation pipeline takes RGB images and sparse depth points as input and predicts dense metric depth maps. It shows great generalisation on underwater data and strong robustness against varying sparsity of the depth points.
  • Figure 2: The SPADE pipeline consists of two stages. Stage 1 generates sparse metric depth points using SLAM or single-shot stereo matching, and a monocular depth estimator predicts a relative depth map from an RGB image. This map is then aligned to the sparse measurements to obtain metric scale. In Stage 2, a scale refinement network built with our Cascade Conv-Deformable Transformer (CCDT) block, uses the sparse scale correction map, aligned depth map, and fused multi-scale features from relative depth estimator's encoder to predict a per-pixel scale correction map, which is then multiplied with the aligned depth map to produce the final dense depth estimate.
  • Figure 3: The Cascade Conv-Deformable Transformer block has two main components, ResNet Block with the Convolutional Block Attention Module (CBAM) CBAM and the transformer encoder with deformable attention dat++.
  • Figure 4: In deformable attention operation, a reference grid $p$ is constructed over the input feature map $x$. All features are projected into queries $q$ and an offset network $\theta_{offset}$ takes in the queries and predicts spatial offsets $\Delta p$ to the grid points to shift them to more informative regions $p^{'}$. Then a subset of the features $\tilde{x}$ are sampled at the deformed grid points through bilinear interpolation. The sampled features are then projected into key-value pairs and multi head attention is calculated between the queries and the sampled key-value pairs. This diagram is an adaptation of the original DAT diagram in dat++.
  • Figure 5: Example results on the FLSea dataset. The boxes highlight dynamic or fine featured regions where the ground truth is missing data but the network predicts correctly.
  • ...and 4 more figures