Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Bingxin Ke; Anton Obukhov; Shengyu Huang; Nando Metzger; Rodrigo Caye Daudt; Konrad Schindler

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, Konrad Schindler

TL;DR

Marigold uses a latent-diffusion image generator (Stable Diffusion) as a rich visual prior for monocular depth estimation. By fine-tuning only the denoiser in the latent space and training on synthetic RGB-D data, it achieves affine-invariant depth with strong zero-shot generalization, delivering state-of-the-art results across diverse real datasets. The method demonstrates that foundation-model priors can generalize 3D scene understanding to unseen domains while remaining computationally practical. The authors also introduce an annealed multi-resolution noise scheme and test-time ensembling to boost robustness and performance.

Abstract

Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases. Project page: https://marigoldmonodepth.github.io.

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

TL;DR

Abstract

Paper Structure (30 sections, 4 equations, 21 figures, 6 tables)

This paper contains 30 sections, 4 equations, 21 figures, 6 tables.

Introduction
Related Work
Monocular Depth
Diffusion Models
Diffusion for Monocular Depth Estimation
Foundation Models
Method
Generative Formulation
Network Architecture
Fine-Tuning Protocol
Inference
Experiments
Implementation
Evaluation
Ablation Studies
...and 15 more sections

Figures (21)

Figure 1: We present Marigold, a diffusion model and associated fine-tuning protocol for monocular depth estimation. Its core principle is to leverage the rich visual knowledge stored in modern generative image models. Our model, derived from Stable Diffusion and fine-tuned with synthetic data, can zero-shot transfer to unseen datasets, offering state-of-the-art monocular depth estimation results.
Figure 2: Overview of the Marigold fine-tuning protocol. Starting from pretrained Stable Diffusion, we encode the image $\mathbf{x}$ and depth $\mathbf{d}$ into the latent space using the original Stable Diffusion VAE. We fine-tune just the U-Net by optimizing the standard diffusion objective relative to the depth latent code. Image conditioning is achieved by concatenating the two latent codes before feeding them into the U-Net. The first layer of the U-Net is modified to accept concatenated latent codes. See details in \ref{['sec:architecture']} and \ref{['sec:finetuning']}.
Figure 3: Overview of the Marigold inference scheme. Given an input image $\mathbf{x}$, we encode it with the original Stable Diffusion VAE into the latent code $\mathbf{z}^{(\mathbf{x})}$, and concatenate with the depth latent $\mathbf{z}^{(\mathbf{d})}_t$ before giving it to the modified fine-tuned U-Net on every denoising iteration. After executing the schedule of $T$ steps, the resulting depth latent $\mathbf{z}^{(\mathbf{d})}_0$ is decoded into an image, whose 3 channels are averaged to get the final estimation $\hat{\mathbf{d}}$. See \ref{['sec:inference']} for details.
Figure 4: Qualitative comparison (depth) of monocular depth estimation methods across different datasets. Marigold excels at capturing thin structures (e.g., chair legs) and preserving overall layout of the scene (e.g., walls in ETH3D example and chairs in DIODE example).
Figure 5: Qualitative comparison (unprojected, colored as normals) of monocular depth estimation methods across different datasets. Marigold stands out for its superior reconstruction of flat surfaces and detailed structures.
...and 16 more figures

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

TL;DR

Abstract

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (21)