Table of Contents
Fetching ...

D$^3$epth: Self-Supervised Depth Estimation with Dynamic Mask in Dynamic Scenes

Siyu Chen, Hong Liu, Wenhao Li, Ying Zhu, Guoquan Wang, Jianbing Wu

TL;DR

The proposed D3epth method tackles the challenge of dynamic objects from two key perspectives, and proposes a spectral entropy uncertainty module that incorporates spectral entropy to guide uncertainty estimation during depth fusion, effectively addressing issues arising from cost volume computation in dynamic environments.

Abstract

Depth estimation is a crucial technology in robotics. Recently, self-supervised depth estimation methods have demonstrated great potential as they can efficiently leverage large amounts of unlabelled real-world data. However, most existing methods are designed under the assumption of static scenes, which hinders their adaptability in dynamic environments. To address this issue, we present D$^3$epth, a novel method for self-supervised depth estimation in dynamic scenes. It tackles the challenge of dynamic objects from two key perspectives. First, within the self-supervised framework, we design a reprojection constraint to identify regions likely to contain dynamic objects, allowing the construction of a dynamic mask that mitigates their impact at the loss level. Second, for multi-frame depth estimation, we introduce a cost volume auto-masking strategy that leverages adjacent frames to identify regions associated with dynamic objects and generate corresponding masks. This provides guidance for subsequent processes. Furthermore, we propose a spectral entropy uncertainty module that incorporates spectral entropy to guide uncertainty estimation during depth fusion, effectively addressing issues arising from cost volume computation in dynamic environments. Extensive experiments on KITTI and Cityscapes datasets demonstrate that the proposed method consistently outperforms existing self-supervised monocular depth estimation baselines. Code is available at \url{https://github.com/Csyunling/D3epth}.

D$^3$epth: Self-Supervised Depth Estimation with Dynamic Mask in Dynamic Scenes

TL;DR

The proposed D3epth method tackles the challenge of dynamic objects from two key perspectives, and proposes a spectral entropy uncertainty module that incorporates spectral entropy to guide uncertainty estimation during depth fusion, effectively addressing issues arising from cost volume computation in dynamic environments.

Abstract

Depth estimation is a crucial technology in robotics. Recently, self-supervised depth estimation methods have demonstrated great potential as they can efficiently leverage large amounts of unlabelled real-world data. However, most existing methods are designed under the assumption of static scenes, which hinders their adaptability in dynamic environments. To address this issue, we present Depth, a novel method for self-supervised depth estimation in dynamic scenes. It tackles the challenge of dynamic objects from two key perspectives. First, within the self-supervised framework, we design a reprojection constraint to identify regions likely to contain dynamic objects, allowing the construction of a dynamic mask that mitigates their impact at the loss level. Second, for multi-frame depth estimation, we introduce a cost volume auto-masking strategy that leverages adjacent frames to identify regions associated with dynamic objects and generate corresponding masks. This provides guidance for subsequent processes. Furthermore, we propose a spectral entropy uncertainty module that incorporates spectral entropy to guide uncertainty estimation during depth fusion, effectively addressing issues arising from cost volume computation in dynamic environments. Extensive experiments on KITTI and Cityscapes datasets demonstrate that the proposed method consistently outperforms existing self-supervised monocular depth estimation baselines. Code is available at \url{https://github.com/Csyunling/D3epth}.

Paper Structure

This paper contains 15 sections, 18 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overall framework of the proposed D3epth. We propose a Dynamic Mask (DM) within the self-supervised framework by masking regions that are likely to be dynamic objects, identified where both reprojection losses exhibit high values. Additionally, we tackle the issue of dynamic objects from the perspective of depth estimation in DepthNet, focusing primarily on multi-frame depth estimation.
  • Figure 2: Comparison of our D3epth with Monodepth2. The upper part of the figure illustrates the typical occlusion scenario addressed by Monodepth2. Both our method and Monodepth2 can handle this situation. However, the lower part of the figure depicts a case where Monodepth2 fails to resolve, as it relies on the minimum of two reprojection losses. In contrast, our method effectively addresses this issue by performing an additional calculation of a Dynamic Mask (DM) to correct the loss.
  • Figure 3: Overall framework of DepthNet in our D3epth. It consists of three main modules: single-frame depth estimation (MonoDepth), multi-frame depth estimation (MultiDepth), and Spectral Entropy Uncertainty (SEU) module. Cost Volume Auto-Masking is applied before computing the cost volume to filter out regions affected by dynamic objects and to guide subsequent processing. And the SEU module leverages the guidance from Cost Volume Auto-Masking and incorporates spectral entropy to enrich the information. This combined approach enhances uncertainty estimation and improves the fusion of MonoDepth and MultiDepth.
  • Figure 4: Qualitative results of D3epth and the baseline on Cityscapes dataset. The first row displays the input images, the second row shows the baseline results, and the third row presents the outputs of our D3epth.