Table of Contents
Fetching ...

BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Xiang Zhang, Bingxin Ke, Hayko Riemenschneider, Nando Metzger, Anton Obukhov, Markus Gross, Konrad Schindler, Christopher Schroers

TL;DR

This work proposes BetterDepth, a conditional diffusion-based refiner that takes the prediction from pre-trained MDE models as depth conditioning, and iteratively refines details based on the input image, and can improve the performance of other MDE models in a plug-and-play manner without further re-training.

Abstract

By training over large-scale datasets, zero-shot monocular depth estimation (MDE) methods show robust performance in the wild but often suffer from insufficient detail. Although recent diffusion-based MDE approaches exhibit a superior ability to extract details, they struggle in geometrically complex scenes that challenge their geometry prior, trained on less diverse 3D data. To leverage the complementary merits of both worlds, we propose BetterDepth to achieve geometrically correct affine-invariant MDE while capturing fine details. Specifically, BetterDepth is a conditional diffusion-based refiner that takes the prediction from pre-trained MDE models as depth conditioning, in which the global depth layout is well-captured, and iteratively refines details based on the input image. For the training of such a refiner, we propose global pre-alignment and local patch masking methods to ensure BetterDepth remains faithful to the depth conditioning while learning to add fine-grained scene details. With efficient training on small-scale synthetic datasets, BetterDepth achieves state-of-the-art zero-shot MDE performance on diverse public datasets and on in-the-wild scenes. Moreover, BetterDepth can improve the performance of other MDE models in a plug-and-play manner without further re-training.

BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

TL;DR

This work proposes BetterDepth, a conditional diffusion-based refiner that takes the prediction from pre-trained MDE models as depth conditioning, and iteratively refines details based on the input image, and can improve the performance of other MDE models in a plug-and-play manner without further re-training.

Abstract

By training over large-scale datasets, zero-shot monocular depth estimation (MDE) methods show robust performance in the wild but often suffer from insufficient detail. Although recent diffusion-based MDE approaches exhibit a superior ability to extract details, they struggle in geometrically complex scenes that challenge their geometry prior, trained on less diverse 3D data. To leverage the complementary merits of both worlds, we propose BetterDepth to achieve geometrically correct affine-invariant MDE while capturing fine details. Specifically, BetterDepth is a conditional diffusion-based refiner that takes the prediction from pre-trained MDE models as depth conditioning, in which the global depth layout is well-captured, and iteratively refines details based on the input image. For the training of such a refiner, we propose global pre-alignment and local patch masking methods to ensure BetterDepth remains faithful to the depth conditioning while learning to add fine-grained scene details. With efficient training on small-scale synthetic datasets, BetterDepth achieves state-of-the-art zero-shot MDE performance on diverse public datasets and on in-the-wild scenes. Moreover, BetterDepth can improve the performance of other MDE models in a plug-and-play manner without further re-training.
Paper Structure (25 sections, 6 equations, 25 figures, 10 tables, 1 algorithm)

This paper contains 25 sections, 6 equations, 25 figures, 10 tables, 1 algorithm.

Figures (25)

  • Figure 1: Monocular depth estimation (depth map and 3D reconstruction with color-coded normals). Feed-forward methods, like Depth Anything yang2024depthanything, produce robust global 3D shape but suffer from over-smoothed details. Diffusion-based methods, like Marigold ke2023marigold, extract fine details but fall short in zero-shot global shape recovery. Our proposed BetterDepth offers the best of both worlds and achieves robust zero-shot depth estimation with fine details.
  • Figure 2: BetterDepth training pipeline. Given training images $\mathbf{x}$ and labels $\mathbf{d}$, we first estimate coarse depth maps $\tilde{\mathbf{d}}$ with the pre-trained $\mathbf{M}_{\mathrm{FFD}}$ and apply global pre-alignment to $\tilde{\mathbf{d}}$ using $\mathbf{d}$ as reference. Afterwards, the frozen latent encoder is employed to convert the image $\mathbf{x}$, the depth labels $\mathbf{d}$, and the aligned depth conditioning $\tilde{\mathbf{d}}'$ to the latent space. To construct the masked training objective, $\tilde{\mathbf{d}}'$ and $\mathbf{d}$ are split into non-overlapping patches $\{\tilde{\mathbf{d}}'_n\}$ and $\{\mathbf{d}_n\}$, and dissimilar patches are filter out by thresholding, producing the patch-level similarity mask. Finally, the mask is downscaled to the latent space resolution for diffusion training.
  • Figure 3: Illustration of output distributions after applying pre-alignment and patch masking. The output distribution of BetterDepth ($\hat{\mathcal{X}}$) is pushed towards the intersection of $\mathcal{X}({\mathbf{M}_{\mathrm{FFD}}, \{\mathbf{D}_{\mathrm{syn}}, \mathbf{D}_{\mathrm{real}}\}})$ and $\mathcal{X}({\mathbf{M}_{\mathrm{DM}}, \mathbf{D}_{\mathrm{syn}}})$ to achieve detailed zero-shot MDE.
  • Figure 4: BetterDepth inference pipeline. Given an image $\mathbf{x}$ and a pre-trained depth model, we first estimate the coarse depth map $\tilde{\mathbf{d}}$ as conditioning. After converting $\mathbf{x}$ and $\tilde{\mathbf{d}}$ to latent space, we concatenate the latent codes $\mathbf{z}^{\mathbf{x}}$, $\mathbf{z}^{\tilde{\mathbf{d}}}$ with the depth latent $\mathbf{z}^{\hat{\mathbf{d}}}_t$ for denoising. After $T$-step refinement, random Gaussian noise $\mathbf{z}^{\hat{\mathbf{d}}}_T$ has been converted to $\mathbf{z}^{\hat{\mathbf{d}}}_0$ and is decoded to the final estimate $\hat{\mathbf{d}}$.
  • Figure 5: Qualitative comparisons of depth estimation and 3D reconstruction results (colored as normals), where Marigold predicts depth values and the others output disparity.
  • ...and 20 more figures