Table of Contents
Fetching ...

Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model

Saurabh Saxena, Junhwa Hur, Charles Herrmann, Deqing Sun, David J. Fleet

TL;DR

This work introduces DMD, a diffusion-based framework for zero-shot metric depth estimation that jointly models indoor and outdoor scenes without task-specific architectural biases. By representing depth in log-scale, augmenting and conditioning on field-of-view, and training on a diverse data mixture with an efficient v-parameterized denoiser, it delivers state-of-the-art REL reductions on multiple zero-shot benchmarks while maintaining fast inference. Extensive ablations demonstrate the critical roles of log-depth, FOV augmentation/conditioning, and diffusion parameterization in enabling robust cross-domain depth estimation. The approach offers a practical route to metric depth in varied environments and camera intrinsics, with potential for further improvements via intrinsic prediction and larger training corpora.

Abstract

While methods for monocular depth estimation have made significant strides on standard benchmarks, zero-shot metric depth estimation remains unsolved. Challenges include the joint modeling of indoor and outdoor scenes, which often exhibit significantly different distributions of RGB and depth, and the depth-scale ambiguity due to unknown camera intrinsics. Recent work has proposed specialized multi-head architectures for jointly modeling indoor and outdoor scenes. In contrast, we advocate a generic, task-agnostic diffusion model, with several advancements such as log-scale depth parameterization to enable joint modeling of indoor and outdoor scenes, conditioning on the field-of-view (FOV) to handle scale ambiguity and synthetically augmenting FOV during training to generalize beyond the limited camera intrinsics in training datasets. Furthermore, by employing a more diverse training mixture than is common, and an efficient diffusion parameterization, our method, DMD (Diffusion for Metric Depth) achieves a 25\% reduction in relative error (REL) on zero-shot indoor and 33\% reduction on zero-shot outdoor datasets over the current SOTA using only a small number of denoising steps. For an overview see https://diffusion-vision.github.io/dmd

Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model

TL;DR

This work introduces DMD, a diffusion-based framework for zero-shot metric depth estimation that jointly models indoor and outdoor scenes without task-specific architectural biases. By representing depth in log-scale, augmenting and conditioning on field-of-view, and training on a diverse data mixture with an efficient v-parameterized denoiser, it delivers state-of-the-art REL reductions on multiple zero-shot benchmarks while maintaining fast inference. Extensive ablations demonstrate the critical roles of log-depth, FOV augmentation/conditioning, and diffusion parameterization in enabling robust cross-domain depth estimation. The approach offers a practical route to metric depth in varied environments and camera intrinsics, with potential for further improvements via intrinsic prediction and larger training corpora.

Abstract

While methods for monocular depth estimation have made significant strides on standard benchmarks, zero-shot metric depth estimation remains unsolved. Challenges include the joint modeling of indoor and outdoor scenes, which often exhibit significantly different distributions of RGB and depth, and the depth-scale ambiguity due to unknown camera intrinsics. Recent work has proposed specialized multi-head architectures for jointly modeling indoor and outdoor scenes. In contrast, we advocate a generic, task-agnostic diffusion model, with several advancements such as log-scale depth parameterization to enable joint modeling of indoor and outdoor scenes, conditioning on the field-of-view (FOV) to handle scale ambiguity and synthetically augmenting FOV during training to generalize beyond the limited camera intrinsics in training datasets. Furthermore, by employing a more diverse training mixture than is common, and an efficient diffusion parameterization, our method, DMD (Diffusion for Metric Depth) achieves a 25\% reduction in relative error (REL) on zero-shot indoor and 33\% reduction on zero-shot outdoor datasets over the current SOTA using only a small number of denoising steps. For an overview see https://diffusion-vision.github.io/dmd
Paper Structure (14 sections, 3 equations, 8 figures, 10 tables)

This paper contains 14 sections, 3 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Relative depth error for DMD (ours) compared to ZoeDepth (SOTA) on eight zero-shot and two in-distribution ($\ast$) benchmarks. DMD outperforms ZoeDepth by a substantial margin on all benchmarks.
  • Figure 2: Qualitative comparison between our method and ZoeDepth bhat2023zoedepth on indoor scenes. Unlike ZoeDepth, our method estimates depths at more accurate scale over diverse datasets.
  • Figure 3: Qualitative comparison between DMD and ZoeDepth bhat2023zoedepth on outdoor scenes. Compared with ZoeDepth bhat2023zoedepth, our method is able to estimate a more accurate depth scale.
  • Figure 4: Linearly scaling depth leads to noisy predictions for images with shallow depth. See Section \ref{['sec:joint-indoor-outdoor']} for more details. Predicting depth in a log-scale fixes this. Note that here we use a max depth of 5 meters for better visualization.
  • Figure 5: Qualitative comparison between DMD-NK (fine-tuned on NYU and KITTI) and DMD-MIX (fine-tuned on KITTI, NYU, nuScenes, and Taskonomy). DMD-MIX further improves depth scale estimation as well as fine details on depth boundaries.
  • ...and 3 more figures