Table of Contents
Fetching ...

Adaptive Discrete Disparity Volume for Self-supervised Monocular Depth Estimation

Jianwei Ren

TL;DR

This work tackles self-supervised monocular depth estimation by addressing rigid, handcrafted depth discretization. It introduces Adaptive Discrete Disparity Volume (ADDV), a differentiable module that learns image-specific bin sets and a per-pixel probability volume, enabling depth estimation via a soft aggregation of adaptive bin centers. To stabilize training without ground-truth supervision, the authors add uniformizing and sharpening regularizations, with a temperature-based sharpening and a bin-balancing loss. Experiments on KITTI show ADDV consistently outperforms uniform discretization methods (UD and SID) at the same bin counts, highlighting the value of per-image adaptivity and the proposed regularizations for robust depth learning in diverse scenes.

Abstract

In self-supervised monocular depth estimation tasks, discrete disparity prediction has been proven to attain higher quality depth maps than common continuous methods. However, current discretization strategies often divide depth ranges of scenes into bins in a handcrafted and rigid manner, limiting model performance. In this paper, we propose a learnable module, Adaptive Discrete Disparity Volume (ADDV), which is capable of dynamically sensing depth distributions in different RGB images and generating adaptive bins for them. Without any extra supervision, this module can be integrated into existing CNN architectures, allowing networks to produce representative values for bins and a probability volume over them. Furthermore, we introduce novel training strategies - uniformizing and sharpening - through a loss term and temperature parameter, respectively, to provide regularizations under self-supervised conditions, preventing model degradation or collapse. Empirical results demonstrate that ADDV effectively processes global information, generating appropriate bins for various scenes and producing higher quality depth maps compared to handcrafted methods.

Adaptive Discrete Disparity Volume for Self-supervised Monocular Depth Estimation

TL;DR

This work tackles self-supervised monocular depth estimation by addressing rigid, handcrafted depth discretization. It introduces Adaptive Discrete Disparity Volume (ADDV), a differentiable module that learns image-specific bin sets and a per-pixel probability volume, enabling depth estimation via a soft aggregation of adaptive bin centers. To stabilize training without ground-truth supervision, the authors add uniformizing and sharpening regularizations, with a temperature-based sharpening and a bin-balancing loss. Experiments on KITTI show ADDV consistently outperforms uniform discretization methods (UD and SID) at the same bin counts, highlighting the value of per-image adaptivity and the proposed regularizations for robust depth learning in diverse scenes.

Abstract

In self-supervised monocular depth estimation tasks, discrete disparity prediction has been proven to attain higher quality depth maps than common continuous methods. However, current discretization strategies often divide depth ranges of scenes into bins in a handcrafted and rigid manner, limiting model performance. In this paper, we propose a learnable module, Adaptive Discrete Disparity Volume (ADDV), which is capable of dynamically sensing depth distributions in different RGB images and generating adaptive bins for them. Without any extra supervision, this module can be integrated into existing CNN architectures, allowing networks to produce representative values for bins and a probability volume over them. Furthermore, we introduce novel training strategies - uniformizing and sharpening - through a loss term and temperature parameter, respectively, to provide regularizations under self-supervised conditions, preventing model degradation or collapse. Empirical results demonstrate that ADDV effectively processes global information, generating appropriate bins for various scenes and producing higher quality depth maps compared to handcrafted methods.
Paper Structure (13 sections, 11 equations, 5 figures, 2 tables)

This paper contains 13 sections, 11 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview. a) The framework consists of an encoder-decoder depth estimation network and a separate pose estimation network. b) Detailed decoder of the depth net. ADDV modules are inserted to achieve adaptive depth discretization.
  • Figure 2: Detailed ADDV. The upper component predicts probability distributions of pixels and aggregates them into a volume, while the lower is designed to generate adaptive bins.
  • Figure 3: Benefit of sharpening. Encouraging distribution to exhibit extremes reduces the bias between soft-argmax and MLE.
  • Figure 4: Qualitative results. All three discretization strategies adopt 32 bins and are evaluated on the validation dataset. Failure cases are highlighted.
  • Figure 5: Analysis of adaptive bins. a) Curves of representative values generated by UD (green), SID (red) and ADDV (blue & orange) for two scenes: Fig 4b and Fig. 4c. b) Histogram of disparity for the same two scenes.