Table of Contents
Fetching ...

DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation

Yiqun Duan, Xianda Guo, Zheng Zhu

TL;DR

DiffusionDepth reframes monocular depth estimation as a diffusion-denoising process operating in a latent depth space guided by monocular visual cues. It introduces a self-diffusion training scheme to cope with sparse ground-truth depth and a Monocular Conditioned Denoising Block to fuse visual information. Trained end-to-end with pixel- and latent-space losses, it achieves state-of-the-art results on KITTI and NYU-Depth-V2 while preserving practical inference speed. The work provides a principled integration of diffusion models into dense 3D perception and offers insights for extending diffusion-based reasoning to other 3D tasks.

Abstract

Monocular depth estimation is a challenging task that predicts the pixel-wise depth from a single 2D image. Current methods typically model this problem as a regression or classification task. We propose DiffusionDepth, a new approach that reformulates monocular depth estimation as a denoising diffusion process. It learns an iterative denoising process to `denoise' random depth distribution into a depth map with the guidance of monocular visual conditions. The process is performed in the latent space encoded by a dedicated depth encoder and decoder. Instead of diffusing ground truth (GT) depth, the model learns to reverse the process of diffusing the refined depth of itself into random depth distribution. This self-diffusion formulation overcomes the difficulty of applying generative models to sparse GT depth scenarios. The proposed approach benefits this task by refining depth estimation step by step, which is superior for generating accurate and highly detailed depth maps. Experimental results on KITTI and NYU-Depth-V2 datasets suggest that a simple yet efficient diffusion approach could reach state-of-the-art performance in both indoor and outdoor scenarios with acceptable inference time.

DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation

TL;DR

DiffusionDepth reframes monocular depth estimation as a diffusion-denoising process operating in a latent depth space guided by monocular visual cues. It introduces a self-diffusion training scheme to cope with sparse ground-truth depth and a Monocular Conditioned Denoising Block to fuse visual information. Trained end-to-end with pixel- and latent-space losses, it achieves state-of-the-art results on KITTI and NYU-Depth-V2 while preserving practical inference speed. The work provides a principled integration of diffusion models into dense 3D perception and offers insights for extending diffusion-based reasoning to other 3D tasks.

Abstract

Monocular depth estimation is a challenging task that predicts the pixel-wise depth from a single 2D image. Current methods typically model this problem as a regression or classification task. We propose DiffusionDepth, a new approach that reformulates monocular depth estimation as a denoising diffusion process. It learns an iterative denoising process to `denoise' random depth distribution into a depth map with the guidance of monocular visual conditions. The process is performed in the latent space encoded by a dedicated depth encoder and decoder. Instead of diffusing ground truth (GT) depth, the model learns to reverse the process of diffusing the refined depth of itself into random depth distribution. This self-diffusion formulation overcomes the difficulty of applying generative models to sparse GT depth scenarios. The proposed approach benefits this task by refining depth estimation step by step, which is superior for generating accurate and highly detailed depth maps. Experimental results on KITTI and NYU-Depth-V2 datasets suggest that a simple yet efficient diffusion approach could reach state-of-the-art performance in both indoor and outdoor scenarios with acceptable inference time.
Paper Structure (29 sections, 16 equations, 11 figures, 6 tables)

This paper contains 29 sections, 16 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Illustration of DiffusionDepth, the model refines the depth map $x_t$ with monocular guidance $c$ from random depth initialization $x_T$ to the refined estimation result $x_o$.
  • Figure 2: Overview of DiffusionDepth. Given monocular visual input, the model employs a feature extractor and multiscale feature aggregation to construct visual guidance conditions. The Monocular Conditioned Denosing Block (MCDB) iteratively refines the depth distribution from noise initialization to refined depth prediction under the guidance of monocular visual conditions.
  • Figure 3: Illustration of Monocular Conditioned Denoising Block. Visual condition is fused with depth latent through hierarchically.
  • Figure 4: Qualitative comparison of proposed DiffusionDepth on the KITTI outdoor driving scenarios against two representative methods, BinsFormer (classification-regression based) and VA-Depth (Variational Refine). We highlight the details with white boxes. The visualization is from the best online results for a fair comparison.
  • Figure 5: Qualitative depth results on the NYU-Depth-v2 dataset.
  • ...and 6 more figures