Table of Contents
Fetching ...

WaterMono: Teacher-Guided Anomaly Masking and Enhancement Boosting for Robust Underwater Self-Supervised Monocular Depth Estimation

Yilin Ding, Kunqian Li, Han Mei, Shuaixin Liu, Guojia Hou

TL;DR

WaterMono tackles underwater monocular depth estimation under challenging conditions where dynamic regions, image degradation, and diverse camera angles hinder self-supervised learning. It introduces a two-stage teacher-student framework with a Teacher-Guided Anomaly Mask (TGAM), Image Enhancement Boosting (IEB) based on a simplified Underwater Image Formation Model, selective distillation, and rotated distillation to boost rotational robustness. The approach yields state-of-the-art depth accuracy on the FLSea benchmark while also delivering visually enhanced images that maintain inter-frame consistency, and it demonstrates strong generalization to new underwater datasets without fine-tuning. Overall, WaterMono reveals a mutually beneficial coupling between depth estimation and underwater image enhancement, enabling more reliable vision-based navigation for AUVs/ROVs.

Abstract

Depth information serves as a crucial prerequisite for various visual tasks, whether on land or underwater. Recently, self-supervised methods have achieved remarkable performance on several terrestrial benchmarks despite the absence of depth annotations. However, in more challenging underwater scenarios, they encounter numerous brand-new obstacles such as the influence of marine life and degradation of underwater images, which break the assumption of a static scene and bring low-quality images, respectively. Besides, the camera angles of underwater images are more diverse. Fortunately, we have discovered that knowledge distillation presents a promising approach for tackling these challenges. In this paper, we propose WaterMono, a novel framework for depth estimation coupled with image enhancement. It incorporates the following key measures: (1) We present a Teacher-Guided Anomaly Mask to identify dynamic regions within the images; (2) We employ depth information combined with the Underwater Image Formation Model to generate enhanced images, which in turn contribute to the depth estimation task; and (3) We utilize a rotated distillation strategy to enhance the model's rotational robustness. Comprehensive experiments demonstrate the effectiveness of our proposed method for both depth estimation and image enhancement. The source code and pre-trained models are available on the project home page: https://github.com/OUCVisionGroup/WaterMono.

WaterMono: Teacher-Guided Anomaly Masking and Enhancement Boosting for Robust Underwater Self-Supervised Monocular Depth Estimation

TL;DR

WaterMono tackles underwater monocular depth estimation under challenging conditions where dynamic regions, image degradation, and diverse camera angles hinder self-supervised learning. It introduces a two-stage teacher-student framework with a Teacher-Guided Anomaly Mask (TGAM), Image Enhancement Boosting (IEB) based on a simplified Underwater Image Formation Model, selective distillation, and rotated distillation to boost rotational robustness. The approach yields state-of-the-art depth accuracy on the FLSea benchmark while also delivering visually enhanced images that maintain inter-frame consistency, and it demonstrates strong generalization to new underwater datasets without fine-tuning. Overall, WaterMono reveals a mutually beneficial coupling between depth estimation and underwater image enhancement, enabling more reliable vision-based navigation for AUVs/ROVs.

Abstract

Depth information serves as a crucial prerequisite for various visual tasks, whether on land or underwater. Recently, self-supervised methods have achieved remarkable performance on several terrestrial benchmarks despite the absence of depth annotations. However, in more challenging underwater scenarios, they encounter numerous brand-new obstacles such as the influence of marine life and degradation of underwater images, which break the assumption of a static scene and bring low-quality images, respectively. Besides, the camera angles of underwater images are more diverse. Fortunately, we have discovered that knowledge distillation presents a promising approach for tackling these challenges. In this paper, we propose WaterMono, a novel framework for depth estimation coupled with image enhancement. It incorporates the following key measures: (1) We present a Teacher-Guided Anomaly Mask to identify dynamic regions within the images; (2) We employ depth information combined with the Underwater Image Formation Model to generate enhanced images, which in turn contribute to the depth estimation task; and (3) We utilize a rotated distillation strategy to enhance the model's rotational robustness. Comprehensive experiments demonstrate the effectiveness of our proposed method for both depth estimation and image enhancement. The source code and pre-trained models are available on the project home page: https://github.com/OUCVisionGroup/WaterMono.
Paper Structure (30 sections, 17 equations, 11 figures, 7 tables)

This paper contains 30 sections, 17 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: From left to right are the (a) input underwater images, (b) enhanced images and (c) the estimated depth maps obtained by proposed WaterMono.
  • Figure 2: Examples of dynamic regions in the FLSea dataset. (a)&(b) caustics, (c) fish, (d) diver.
  • Figure 3: Overview of our WaterMono training pipeline. In the first stage, we conduct self-supervised training for a teacher depth network and a teacher pose network. Using the teacher depth network, we generate pseudo depth labels for input images. In the second stage, we freeze the teacher networks and train the student depth and pose networks from scratch. The training approach for the student networks involves a mixture of self-supervised and supervised techniques. The student networks compute photometric loss on enhanced images and utilizes TGAM to filter out dynamic regions. Pseudo labels, filtered through 3D consistency check, are also employed to supervise the student depth network. Additionally, paired images and pseudo labels under rotated camera angles are generated through rotation transformations for the depth student network's supervised learning.
  • Figure 4: Examples of $m_{t}$, where black pixels are removed from loss. We can see that, $m_{t}$ can mask dynamic regions like fish and caustics.
  • Figure 5: Qualitative comparison on FLSea OUC test set. The first row consists of input images. Results from DCPhe2010single, UDCPdrews2016underwater, ULAPsong2018rapid, UW-Netgupta2019unsupervised, Monodepth2godard2019digging, HR-Depthlyu2021hr, DIFFNetzhou2021self, ManyDepth watson2021temporal, MonoViTzhao2022monovit, Lite-Monozhang2023lite and our method (WaterMono) are listed from the second to the twelfth row. The ground truth is shown at the bottom, where black areas indicate missing depth information in the depth map.
  • ...and 6 more figures