Table of Contents
Fetching ...

Monocular Depth Estimation using Diffusion Models

Saurabh Saxena, Abhishek Kar, Mohammad Norouzi, David J. Fleet

TL;DR

This work introduces DepthGen, a diffusion-model-based framework for monocular depth estimation that handles noisy and incomplete training data via depth infilling, an $L_1$ loss, and step-unrolled denoising. It leverages self-supervised pretraining (Palette-style) followed by supervised fine-tuning, achieving state-of-the-art results on NYU and strong performance on KITTI. DepthGen inherently represents multimodal depth distributions, enabling depth ambiguity resolution and zero-shot depth completion, which in turn supports text-to-3D and novel-view synthesis pipelines when integrated with image diffusion models. The approach demonstrates the practical impact of diffusion models for depth tasks and opens avenues for multimodal 3D scene generation from text.

Abstract

We formulate monocular depth estimation using denoising diffusion models, inspired by their recent successes in high fidelity image generation. To that end, we introduce innovations to address problems arising due to noisy, incomplete depth maps in training data, including step-unrolled denoising diffusion, an $L_1$ loss, and depth infilling during training. To cope with the limited availability of data for supervised training, we leverage pre-training on self-supervised image-to-image translation tasks. Despite the simplicity of the approach, with a generic loss and architecture, our DepthGen model achieves SOTA performance on the indoor NYU dataset, and near SOTA results on the outdoor KITTI dataset. Further, with a multimodal posterior, DepthGen naturally represents depth ambiguity (e.g., from transparent surfaces), and its zero-shot performance combined with depth imputation, enable a simple but effective text-to-3D pipeline. Project page: https://depth-gen.github.io

Monocular Depth Estimation using Diffusion Models

TL;DR

This work introduces DepthGen, a diffusion-model-based framework for monocular depth estimation that handles noisy and incomplete training data via depth infilling, an loss, and step-unrolled denoising. It leverages self-supervised pretraining (Palette-style) followed by supervised fine-tuning, achieving state-of-the-art results on NYU and strong performance on KITTI. DepthGen inherently represents multimodal depth distributions, enabling depth ambiguity resolution and zero-shot depth completion, which in turn supports text-to-3D and novel-view synthesis pipelines when integrated with image diffusion models. The approach demonstrates the practical impact of diffusion models for depth tasks and opens avenues for multimodal 3D scene generation from text.

Abstract

We formulate monocular depth estimation using denoising diffusion models, inspired by their recent successes in high fidelity image generation. To that end, we introduce innovations to address problems arising due to noisy, incomplete depth maps in training data, including step-unrolled denoising diffusion, an loss, and depth infilling during training. To cope with the limited availability of data for supervised training, we leverage pre-training on self-supervised image-to-image translation tasks. Despite the simplicity of the approach, with a generic loss and architecture, our DepthGen model achieves SOTA performance on the indoor NYU dataset, and near SOTA results on the outdoor KITTI dataset. Further, with a multimodal posterior, DepthGen naturally represents depth ambiguity (e.g., from transparent surfaces), and its zero-shot performance combined with depth imputation, enable a simple but effective text-to-3D pipeline. Project page: https://depth-gen.github.io
Paper Structure (18 sections, 1 equation, 10 figures, 11 tables, 1 algorithm)

This paper contains 18 sections, 1 equation, 10 figures, 11 tables, 1 algorithm.

Figures (10)

  • Figure 1: Training Architecture. Given a groundtruth depth map, we first infill missing depth using nearest neighbor interpolation. Then, following standard diffusion training, we add noise to the depth map and train a neural network to predict the noise given the RGB image and noisy depth map. During finetuning, we unroll one step of the forward pass and replace the groundtruth depth map with the prediction.
  • Figure 2: Examples of multimodal predictions on the NYU Depth V2 val dataset. Rows 1-2 contain glass doors/windows where the model learns to predict the depth for either the glass surface or the surface behind it. Row 3 has a dark area next to the refrigerator for which the depth is unclear from RGB alone. In row 4 the model hallucinates the reflected door as a bath cabinet, which seems plausible from the RGB image.
  • Figure 3: Multimodal depth predictions on the KITTI val dataset.
  • Figure 4: Text to 3D samples. Given a text prompt, an image is first generated using Imagen (first row of first column), after which depth is estimated (second row of first column). Subsequently the camera is moved to reveal new parts of the scene, which are then infilled using an image completion model and DepthGen (which conditions on both the incomplete depth map and the filled image). At each step, newly generated RGBD points are added to a global point cloud which is visualized in the rightmost column. See \ref{['tab:text_to_3d_samples_extras']} for more samples.
  • Figure 5: Pipeline for iteratively generating a 3D scene conditioned on text $c=A\ bedroom.$ See text for details.
  • ...and 5 more figures