Table of Contents
Fetching ...

Digging into contrastive learning for robust depth estimation with diffusion models

Jiyuan Wang, Chunyu Lin, Lang Nie, Kang Liao, Shuwei Shao, Yao Zhao

TL;DR

This paper proposes a novel robust depth estimation method called D4RD, featuring a custom contrastive learning mode tailored for diffusion models to mitigate performance degradation in complex environments, and builds the 'trinity' contrastive scheme.

Abstract

Recently, diffusion-based depth estimation methods have drawn widespread attention due to their elegant denoising patterns and promising performance. However, they are typically unreliable under adverse conditions prevalent in real-world scenarios, such as rainy, snowy, etc. In this paper, we propose a novel robust depth estimation method called D4RD, featuring a custom contrastive learning mode tailored for diffusion models to mitigate performance degradation in complex environments. Concretely, we integrate the strength of knowledge distillation into contrastive learning, building the `trinity' contrastive scheme. This scheme utilizes the sampled noise of the forward diffusion process as a natural reference, guiding the predicted noise in diverse scenes toward a more stable and precise optimum. Moreover, we extend noise-level trinity to encompass more generic feature and image levels, establishing a multi-level contrast to distribute the burden of robust perception across the overall network. Before addressing complex scenarios, we enhance the stability of the baseline diffusion model with three straightforward yet effective improvements, which facilitate convergence and remove depth outliers. Extensive experiments demonstrate that D4RD surpasses existing state-of-the-art solutions on synthetic corruption datasets and real-world weather conditions. Source code and data are available at \url{https://github.com/wangjiyuan9/D4RD}.

Digging into contrastive learning for robust depth estimation with diffusion models

TL;DR

This paper proposes a novel robust depth estimation method called D4RD, featuring a custom contrastive learning mode tailored for diffusion models to mitigate performance degradation in complex environments, and builds the 'trinity' contrastive scheme.

Abstract

Recently, diffusion-based depth estimation methods have drawn widespread attention due to their elegant denoising patterns and promising performance. However, they are typically unreliable under adverse conditions prevalent in real-world scenarios, such as rainy, snowy, etc. In this paper, we propose a novel robust depth estimation method called D4RD, featuring a custom contrastive learning mode tailored for diffusion models to mitigate performance degradation in complex environments. Concretely, we integrate the strength of knowledge distillation into contrastive learning, building the `trinity' contrastive scheme. This scheme utilizes the sampled noise of the forward diffusion process as a natural reference, guiding the predicted noise in diverse scenes toward a more stable and precise optimum. Moreover, we extend noise-level trinity to encompass more generic feature and image levels, establishing a multi-level contrast to distribute the burden of robust perception across the overall network. Before addressing complex scenarios, we enhance the stability of the baseline diffusion model with three straightforward yet effective improvements, which facilitate convergence and remove depth outliers. Extensive experiments demonstrate that D4RD surpasses existing state-of-the-art solutions on synthetic corruption datasets and real-world weather conditions. Source code and data are available at \url{https://github.com/wangjiyuan9/D4RD}.
Paper Structure (20 sections, 19 equations, 8 figures, 7 tables)

This paper contains 20 sections, 19 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Comparisons with RMDE methodwjyiccvecdp on KITTI and WeatherKITTI. The length of the Line represents the performance degradation magnitude under adverse environments. All the methods are trained on the same dataset WeatherKITTI for fairness.
  • Figure 2: The training framework of D4RD is depicted in the upper part of the image, where the teacher model $\mathcal{F}_{T}$ is enclosed in the green dashed box, while the student network $\mathcal{F}_{S}$ comprises the remaining parts. There are only 3 components in the whole network: the base feature network (having two symbols, $\mathcal{F}_{fT}$ in $\mathcal{F}_T$, $\mathcal{F}_{fS}$ in $\mathcal{F}_S$), the diffusion process(grey dashed box), and the robust CNN $\mathcal{F}_R$. Below that, multi-level trinity learning (i.e., images, depth features, and noise prediction) is presented through (a), (b), and (c), respectively.
  • Figure 3: Visual comparisons among three types of robust learning methods. Our trinity contrast method is more close to our expected effect and achieves better actual performance.
  • Figure 4: Visual results of each level trinity. Compared to the origin image, as shown in the red dashed rectangle, the image level trinity can assist in handling water surface artifacts and ground snow. The feature level trinity builds a coarse depth map with some wrong edges but the noise trinity fixes them all. Better viewed when zooming in.
  • Figure 5: Qualitative results for the WeatherKITTI dataset. We compare D4RD with the current SoTA RMDE methods in the adverse rain and snow subsets. The part marked with 'Clear' is the corresponding sunny image (processed for more clarity). Regions with prominent differences are highlighted using dashed boxes.
  • ...and 3 more figures