BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Xiang Zhang; Bingxin Ke; Hayko Riemenschneider; Nando Metzger; Anton Obukhov; Markus Gross; Konrad Schindler; Christopher Schroers

BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Xiang Zhang, Bingxin Ke, Hayko Riemenschneider, Nando Metzger, Anton Obukhov, Markus Gross, Konrad Schindler, Christopher Schroers

TL;DR

This work proposes BetterDepth, a conditional diffusion-based refiner that takes the prediction from pre-trained MDE models as depth conditioning, and iteratively refines details based on the input image, and can improve the performance of other MDE models in a plug-and-play manner without further re-training.

Abstract

By training over large-scale datasets, zero-shot monocular depth estimation (MDE) methods show robust performance in the wild but often suffer from insufficient detail. Although recent diffusion-based MDE approaches exhibit a superior ability to extract details, they struggle in geometrically complex scenes that challenge their geometry prior, trained on less diverse 3D data. To leverage the complementary merits of both worlds, we propose BetterDepth to achieve geometrically correct affine-invariant MDE while capturing fine details. Specifically, BetterDepth is a conditional diffusion-based refiner that takes the prediction from pre-trained MDE models as depth conditioning, in which the global depth layout is well-captured, and iteratively refines details based on the input image. For the training of such a refiner, we propose global pre-alignment and local patch masking methods to ensure BetterDepth remains faithful to the depth conditioning while learning to add fine-grained scene details. With efficient training on small-scale synthetic datasets, BetterDepth achieves state-of-the-art zero-shot MDE performance on diverse public datasets and on in-the-wild scenes. Moreover, BetterDepth can improve the performance of other MDE models in a plug-and-play manner without further re-training.

BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

TL;DR

Abstract

Paper Structure (25 sections, 6 equations, 25 figures, 10 tables, 1 algorithm)

This paper contains 25 sections, 6 equations, 25 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Method
Problem Formulation
BetterDepth Framework
Training Strategies
Inference Strategies
Experiments and Analysis
Experimental Settings
Benchmarking
Ablation Study
Method Analysis
Conclusion
Training Procedure
Comparison with Depth Anything V2
...and 10 more sections

Figures (25)

Figure 1: Monocular depth estimation (depth map and 3D reconstruction with color-coded normals). Feed-forward methods, like Depth Anything yang2024depthanything, produce robust global 3D shape but suffer from over-smoothed details. Diffusion-based methods, like Marigold ke2023marigold, extract fine details but fall short in zero-shot global shape recovery. Our proposed BetterDepth offers the best of both worlds and achieves robust zero-shot depth estimation with fine details.
Figure 2: BetterDepth training pipeline. Given training images $\mathbf{x}$ and labels $\mathbf{d}$, we first estimate coarse depth maps $\tilde{\mathbf{d}}$ with the pre-trained $\mathbf{M}_{\mathrm{FFD}}$ and apply global pre-alignment to $\tilde{\mathbf{d}}$ using $\mathbf{d}$ as reference. Afterwards, the frozen latent encoder is employed to convert the image $\mathbf{x}$, the depth labels $\mathbf{d}$, and the aligned depth conditioning $\tilde{\mathbf{d}}'$ to the latent space. To construct the masked training objective, $\tilde{\mathbf{d}}'$ and $\mathbf{d}$ are split into non-overlapping patches $\{\tilde{\mathbf{d}}'_n\}$ and $\{\mathbf{d}_n\}$, and dissimilar patches are filter out by thresholding, producing the patch-level similarity mask. Finally, the mask is downscaled to the latent space resolution for diffusion training.
Figure 3: Illustration of output distributions after applying pre-alignment and patch masking. The output distribution of BetterDepth ($\hat{\mathcal{X}}$) is pushed towards the intersection of $\mathcal{X}({\mathbf{M}_{\mathrm{FFD}}, \{\mathbf{D}_{\mathrm{syn}}, \mathbf{D}_{\mathrm{real}}\}})$ and $\mathcal{X}({\mathbf{M}_{\mathrm{DM}}, \mathbf{D}_{\mathrm{syn}}})$ to achieve detailed zero-shot MDE.
Figure 4: BetterDepth inference pipeline. Given an image $\mathbf{x}$ and a pre-trained depth model, we first estimate the coarse depth map $\tilde{\mathbf{d}}$ as conditioning. After converting $\mathbf{x}$ and $\tilde{\mathbf{d}}$ to latent space, we concatenate the latent codes $\mathbf{z}^{\mathbf{x}}$, $\mathbf{z}^{\tilde{\mathbf{d}}}$ with the depth latent $\mathbf{z}^{\hat{\mathbf{d}}}_t$ for denoising. After $T$-step refinement, random Gaussian noise $\mathbf{z}^{\hat{\mathbf{d}}}_T$ has been converted to $\mathbf{z}^{\hat{\mathbf{d}}}_0$ and is decoded to the final estimate $\hat{\mathbf{d}}$.
Figure 5: Qualitative comparisons of depth estimation and 3D reconstruction results (colored as normals), where Marigold predicts depth values and the others output disparity.
...and 20 more figures

BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

TL;DR

Abstract

BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (25)