Table of Contents
Fetching ...

Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation

Yifan Mao, Jian Liu, Xianming Liu

TL;DR

This paper introduces a novel approach named Stealing Stable Diffusion (SSD) prior for robust monocular depth estimation that addresses this limitation by utilizing stable diffusion to generate synthetic images that mimic challenging conditions.

Abstract

Monocular depth estimation is a crucial task in computer vision. While existing methods have shown impressive results under standard conditions, they often face challenges in reliably performing in scenarios such as low-light or rainy conditions due to the absence of diverse training data. This paper introduces a novel approach named Stealing Stable Diffusion (SSD) prior for robust monocular depth estimation. The approach addresses this limitation by utilizing stable diffusion to generate synthetic images that mimic challenging conditions. Additionally, a self-training mechanism is introduced to enhance the model's depth estimation capability in such challenging environments. To enhance the utilization of the stable diffusion prior further, the DINOv2 encoder is integrated into the depth model architecture, enabling the model to leverage rich semantic priors and improve its scene understanding. Furthermore, a teacher loss is introduced to guide the student models in acquiring meaningful knowledge independently, thus reducing their dependency on the teacher models. The effectiveness of the approach is evaluated on nuScenes and Oxford RobotCar, two challenging public datasets, with the results showing the efficacy of the method. Source code and weights are available at: https://github.com/hitcslj/SSD.

Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation

TL;DR

This paper introduces a novel approach named Stealing Stable Diffusion (SSD) prior for robust monocular depth estimation that addresses this limitation by utilizing stable diffusion to generate synthetic images that mimic challenging conditions.

Abstract

Monocular depth estimation is a crucial task in computer vision. While existing methods have shown impressive results under standard conditions, they often face challenges in reliably performing in scenarios such as low-light or rainy conditions due to the absence of diverse training data. This paper introduces a novel approach named Stealing Stable Diffusion (SSD) prior for robust monocular depth estimation. The approach addresses this limitation by utilizing stable diffusion to generate synthetic images that mimic challenging conditions. Additionally, a self-training mechanism is introduced to enhance the model's depth estimation capability in such challenging environments. To enhance the utilization of the stable diffusion prior further, the DINOv2 encoder is integrated into the depth model architecture, enabling the model to leverage rich semantic priors and improve its scene understanding. Furthermore, a teacher loss is introduced to guide the student models in acquiring meaningful knowledge independently, thus reducing their dependency on the teacher models. The effectiveness of the approach is evaluated on nuScenes and Oxford RobotCar, two challenging public datasets, with the results showing the efficacy of the method. Source code and weights are available at: https://github.com/hitcslj/SSD.
Paper Structure (20 sections, 7 equations, 6 figures, 3 tables)

This paper contains 20 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Shortcomings of GAN. Compared to GAN, which suffers from issues like noise, fake rainy effects, and blurriness, our GDT can generate more diverse and realistic images.
  • Figure 2: Darkness and weather effects on sensors. The images above are from the nuScenescaesar2020nuscenes. In night-time photos, RGB images often exhibit noise, textureless regions, and blurriness, which are not conducive to self-supervised learning. Additionally, rainy weather can introduce blur and reflections, leading to sparse and unreliable LiDAR signals, which are not suitable for supervised learning.
  • Figure 3: The GDT pipeline incorporates multiple large models and PatchFusion to generate high-quality training samples.
  • Figure 4: SSD framework for robust depth estimation. The Student Net receives guidance from the Teacher Net, leveraging a stable diffusion prior. The semantic loss ensures semantic consistency, while the teacher loss enables the Student Net to learn beyond the capabilities of the Teacher Net.
  • Figure 5: Comparison of samples from the nuScenes dataset caesar2020nuscenes among monodepth2 monodepth2, md4all-DD md4all, and our self-supervised teacher model SSD-T, as well as the student model SSD-S.
  • ...and 1 more figures