Table of Contents
Fetching ...

ProDepth: Boosting Self-Supervised Multi-Frame Monocular Depth with Probabilistic Fusion

Sungmin Woo, Wonjoon Lee, Woo Jin Kim, Dogyoon Lee, Sangyoun Lee

TL;DR

ProDepth addresses the core challenge of dynamic objects breaking the static-scene assumption in self-supervised multi-frame monocular depth estimation by introducing an auxiliary depth decoder to infer per-pixel uncertainty, a Probabilistic Cost Volume Modulation (PCVM) to directly rectify corrupted cost distributions through probabilistic fusion of single-frame and multi-frame cues, and a loss reweighting strategy to curb incorrect self-supervision in dynamic regions. The method learns depth distributions per pixel and fuses them via a per-pixel uncertainty map, enabling robust depth predictions in dynamic scenes without additional semantic supervision. Empirical results on Cityscapes and KITTI show state-of-the-art performance, with strong generalization on Waymo Open, demonstrating the effectiveness of probabilistic fusion and uncertainty-aware learning for dynamic-object handling in self-supervised depth estimation.

Abstract

Self-supervised multi-frame monocular depth estimation relies on the geometric consistency between successive frames under the assumption of a static scene. However, the presence of moving objects in dynamic scenes introduces inevitable inconsistencies, causing misaligned multi-frame feature matching and misleading self-supervision during training. In this paper, we propose a novel framework called ProDepth, which effectively addresses the mismatch problem caused by dynamic objects using a probabilistic approach. We initially deduce the uncertainty associated with static scene assumption by adopting an auxiliary decoder. This decoder analyzes inconsistencies embedded in the cost volume, inferring the probability of areas being dynamic. We then directly rectify the erroneous cost volume for dynamic areas through a Probabilistic Cost Volume Modulation (PCVM) module. Specifically, we derive probability distributions of depth candidates from both single-frame and multi-frame cues, modulating the cost volume by adaptively fusing those distributions based on the inferred uncertainty. Additionally, we present a self-supervision loss reweighting strategy that not only masks out incorrect supervision with high uncertainty but also mitigates the risks in remaining possible dynamic areas in accordance with the probability. Our proposed method excels over state-of-the-art approaches in all metrics on both Cityscapes and KITTI datasets, and demonstrates superior generalization ability on the Waymo Open dataset.

ProDepth: Boosting Self-Supervised Multi-Frame Monocular Depth with Probabilistic Fusion

TL;DR

ProDepth addresses the core challenge of dynamic objects breaking the static-scene assumption in self-supervised multi-frame monocular depth estimation by introducing an auxiliary depth decoder to infer per-pixel uncertainty, a Probabilistic Cost Volume Modulation (PCVM) to directly rectify corrupted cost distributions through probabilistic fusion of single-frame and multi-frame cues, and a loss reweighting strategy to curb incorrect self-supervision in dynamic regions. The method learns depth distributions per pixel and fuses them via a per-pixel uncertainty map, enabling robust depth predictions in dynamic scenes without additional semantic supervision. Empirical results on Cityscapes and KITTI show state-of-the-art performance, with strong generalization on Waymo Open, demonstrating the effectiveness of probabilistic fusion and uncertainty-aware learning for dynamic-object handling in self-supervised depth estimation.

Abstract

Self-supervised multi-frame monocular depth estimation relies on the geometric consistency between successive frames under the assumption of a static scene. However, the presence of moving objects in dynamic scenes introduces inevitable inconsistencies, causing misaligned multi-frame feature matching and misleading self-supervision during training. In this paper, we propose a novel framework called ProDepth, which effectively addresses the mismatch problem caused by dynamic objects using a probabilistic approach. We initially deduce the uncertainty associated with static scene assumption by adopting an auxiliary decoder. This decoder analyzes inconsistencies embedded in the cost volume, inferring the probability of areas being dynamic. We then directly rectify the erroneous cost volume for dynamic areas through a Probabilistic Cost Volume Modulation (PCVM) module. Specifically, we derive probability distributions of depth candidates from both single-frame and multi-frame cues, modulating the cost volume by adaptively fusing those distributions based on the inferred uncertainty. Additionally, we present a self-supervision loss reweighting strategy that not only masks out incorrect supervision with high uncertainty but also mitigates the risks in remaining possible dynamic areas in accordance with the probability. Our proposed method excels over state-of-the-art approaches in all metrics on both Cityscapes and KITTI datasets, and demonstrates superior generalization ability on the Waymo Open dataset.
Paper Structure (25 sections, 14 equations, 9 figures, 9 tables)

This paper contains 25 sections, 14 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Our ProDepth performs uncertainty-aware adaptive fusion of the probability distributions from both single-frame and multi-frame cues. The fused distribution follows the distribution of single-frame cues for a dynamic pixel, while adhering to the distribution of multi-frame cues for a static pixel. Error maps in the second column depict large depth errors in green and small in blue.
  • Figure 2: Overview of the proposed ProDepth. We construct the multi-frame cost volume with $I_{s}$ and $I_t$, and estimate single-frame depth as a Gaussian distribution using the target image $I_t$. In an auxiliary branch, uncertainty is inferred by comparing $D_{\text{single}}$ and $D_\text{cv}$, where the latter is estimated from cost volume features. To rectify erroneous cost volume, a PCVM module adaptively fuses probabilities derived from single- and multi-frame cues. Furthermore, we incorporate a loss reweighting strategy in $\mathcal{L}_{up,s}$ and $\mathcal{L}^{log}_{up,s}$ to mitigate errors caused by moving objects at the training-level. Note that the probability distribution of a dynamic pixel is illustrated as an example.
  • Figure 3: The identification of dynamic objects. In contrast to the binary consistency mask generated in ManyDepth manydepth, our uncertainty reasons the probability of moving objects with structural awareness.
  • Figure 4: Qualitative results on Cityscapes. Red and yellow boxes indicate moving and static objects. Error maps depict large depth errors in red and small in blue.
  • Figure 5: ProDepth with and without the PCVM module. Depth probability distributions of a dynamic yellow pixel are presented. Our PCVM modulates the incorrect distribution in cost volume, rectifying the errors in dynamic areas.
  • ...and 4 more figures