Table of Contents
Fetching ...

On Robust Cross-View Consistency in Self-Supervised Monocular Depth Estimation

Haimei Zhao, Jing Zhang, Zhuo Chen, Bo Yuan, Dacheng Tao

TL;DR

Two kinds of robust cross-view consistency are studied, which exploit the temporal coherence in both depth feature space and 3D voxel space for SS-MDE, shifting the “point-to-point” alignment paradigm to the “region-to-region” one.

Abstract

Remarkable progress has been made in self-supervised monocular depth estimation (SS-MDE) by exploring cross-view consistency, e.g., photometric consistency and 3D point cloud consistency. However, they are very vulnerable to illumination variance, occlusions, texture-less regions, as well as moving objects, making them not robust enough to deal with various scenes. To address this challenge, we study two kinds of robust cross-view consistency in this paper. Firstly, the spatial offset field between adjacent frames is obtained by reconstructing the reference frame from its neighbors via deformable alignment, which is used to align the temporal depth features via a Depth Feature Alignment (DFA) loss. Secondly, the 3D point clouds of each reference frame and its nearby frames are calculated and transformed into voxel space, where the point density in each voxel is calculated and aligned via a Voxel Density Alignment (VDA) loss. In this way, we exploit the temporal coherence in both depth feature space and 3D voxel space for SS-MDE, shifting the "point-to-point" alignment paradigm to the "region-to-region" one. Compared with the photometric consistency loss as well as the rigid point cloud alignment loss, the proposed DFA and VDA losses are more robust owing to the strong representation power of deep features as well as the high tolerance of voxel density to the aforementioned challenges. Experimental results on several outdoor benchmarks show that our method outperforms current state-of-the-art techniques. Extensive ablation study and analysis validate the effectiveness of the proposed losses, especially in challenging scenes. The code and models are available at https://github.com/sunnyHelen/RCVC-depth.

On Robust Cross-View Consistency in Self-Supervised Monocular Depth Estimation

TL;DR

Two kinds of robust cross-view consistency are studied, which exploit the temporal coherence in both depth feature space and 3D voxel space for SS-MDE, shifting the “point-to-point” alignment paradigm to the “region-to-region” one.

Abstract

Remarkable progress has been made in self-supervised monocular depth estimation (SS-MDE) by exploring cross-view consistency, e.g., photometric consistency and 3D point cloud consistency. However, they are very vulnerable to illumination variance, occlusions, texture-less regions, as well as moving objects, making them not robust enough to deal with various scenes. To address this challenge, we study two kinds of robust cross-view consistency in this paper. Firstly, the spatial offset field between adjacent frames is obtained by reconstructing the reference frame from its neighbors via deformable alignment, which is used to align the temporal depth features via a Depth Feature Alignment (DFA) loss. Secondly, the 3D point clouds of each reference frame and its nearby frames are calculated and transformed into voxel space, where the point density in each voxel is calculated and aligned via a Voxel Density Alignment (VDA) loss. In this way, we exploit the temporal coherence in both depth feature space and 3D voxel space for SS-MDE, shifting the "point-to-point" alignment paradigm to the "region-to-region" one. Compared with the photometric consistency loss as well as the rigid point cloud alignment loss, the proposed DFA and VDA losses are more robust owing to the strong representation power of deep features as well as the high tolerance of voxel density to the aforementioned challenges. Experimental results on several outdoor benchmarks show that our method outperforms current state-of-the-art techniques. Extensive ablation study and analysis validate the effectiveness of the proposed losses, especially in challenging scenes. The code and models are available at https://github.com/sunnyHelen/RCVC-depth.
Paper Structure (65 sections, 54 equations, 21 figures, 16 tables)

This paper contains 65 sections, 54 equations, 21 figures, 16 tables.

Figures (21)

  • Figure 1: Visualization of the photometric loss. (Black areas represent regions without photometric loss, whereas warmer colors indicate larger loss values in other regions.). The first row is the reference image, and the second and third rows are warped images from adjacent images using ground truth depth and pose.
  • Figure 2: Comparisons of prior "point-to-point" alignment paradigm to our "region-to-region" one. We propose the "region-to-region" alignment paradigm by enforcing photometric consistency at feature-level (a) and replacing point cloud alignment with voxel density alignment in 3D space (b).
  • Figure 3: An illustration of our learning framework, which consists of DepthNet, PoseNet, and OffsetNet for depth estimation, pose estimation, and alignment offset learning respectively. OffsetNet learns feature alignment offset field using self-supervised loss calculated by reconstructing reference from adjacent views with deformable convolutions. The learned offset field is then used to align temporal depth features learned from DepthNet. The three branches in the framework are jointly optimized during training while only DepthNet is used during inference.
  • Figure 4: Illustration of the guidance from the correspondence in RGB images to the correspondence in depth.
  • Figure 5: Illustration of the key process of OffsetNet, which aims to learn feature alignment offsets from RGB frames. The learned offsets are then used to align depth features.
  • ...and 16 more figures