Table of Contents
Fetching ...

MetricGold: Leveraging Text-To-Image Latent Diffusion Models for Metric Depth Estimation

Ansh Shah, K Madhava Krishna

TL;DR

This work tackles monocular metric depth estimation by exploiting the rich priors of pretrained latent diffusion models. MetricGold reframes depth prediction as conditional denoising in a latent diffusion framework built on a Stable Diffusion backbone, incorporating a Depth VAE and a conditioned Denoising U-Net, and trains on photo-realistic synthetic data to achieve strong zero-shot generalization. Key contributions include (i) latent-space depth modeling with image-conditioned latents, (ii) log-depth representation to unify indoor and outdoor scales, and (iii) an efficient training protocol that completes on a single RTX 3090 in about two days. The approach yields sharper, more accurate metric depth estimates across diverse datasets, highlighting the practicality of diffusion priors for scalable, cross-domain depth estimation without real-world depth supervision.

Abstract

Recovering metric depth from a single image remains a fundamental challenge in computer vision, requiring both scene understanding and accurate scaling. While deep learning has advanced monocular depth estimation, current models often struggle with unfamiliar scenes and layouts, particularly in zero-shot scenarios and when predicting scale-ergodic metric depth. We present MetricGold, a novel approach that harnesses generative diffusion model's rich priors to improve metric depth estimation. Building upon recent advances in MariGold, DDVM and Depth Anything V2 respectively, our method combines latent diffusion, log-scaled metric depth representation, and synthetic data training. MetricGold achieves efficient training on a single RTX 3090 within two days using photo-realistic synthetic data from HyperSIM, VirtualKitti, and TartanAir. Our experiments demonstrate robust generalization across diverse datasets, producing sharper and higher quality metric depth estimates compared to existing approaches.

MetricGold: Leveraging Text-To-Image Latent Diffusion Models for Metric Depth Estimation

TL;DR

This work tackles monocular metric depth estimation by exploiting the rich priors of pretrained latent diffusion models. MetricGold reframes depth prediction as conditional denoising in a latent diffusion framework built on a Stable Diffusion backbone, incorporating a Depth VAE and a conditioned Denoising U-Net, and trains on photo-realistic synthetic data to achieve strong zero-shot generalization. Key contributions include (i) latent-space depth modeling with image-conditioned latents, (ii) log-depth representation to unify indoor and outdoor scales, and (iii) an efficient training protocol that completes on a single RTX 3090 in about two days. The approach yields sharper, more accurate metric depth estimates across diverse datasets, highlighting the practicality of diffusion priors for scalable, cross-domain depth estimation without real-world depth supervision.

Abstract

Recovering metric depth from a single image remains a fundamental challenge in computer vision, requiring both scene understanding and accurate scaling. While deep learning has advanced monocular depth estimation, current models often struggle with unfamiliar scenes and layouts, particularly in zero-shot scenarios and when predicting scale-ergodic metric depth. We present MetricGold, a novel approach that harnesses generative diffusion model's rich priors to improve metric depth estimation. Building upon recent advances in MariGold, DDVM and Depth Anything V2 respectively, our method combines latent diffusion, log-scaled metric depth representation, and synthetic data training. MetricGold achieves efficient training on a single RTX 3090 within two days using photo-realistic synthetic data from HyperSIM, VirtualKitti, and TartanAir. Our experiments demonstrate robust generalization across diverse datasets, producing sharper and higher quality metric depth estimates compared to existing approaches.

Paper Structure

This paper contains 13 sections, 6 equations, 2 figures.

Figures (2)

  • Figure 1: We present MetricGold, a diffusion model and associated fine-tuning protocol for monocular metric depth estimation. Its core principle is to leverage the rich visual knowledge stored in modern generative image models. Our model, derived from Stable Diffusion and fine-tuned with photorealistic synthetic data, can zero-shot transfer to unseen datasets, offering sharp monocular metric depth estimation results.
  • Figure 2: Overview of the MetricGold fine-tuning protocol: Beginning with a pretrained Stable Diffusion model, we first fine-tune the VAE by applying a reconstruction loss on log-normalized metric depth. The image and depth are encoded into their latent spaces using the original Stable Diffusion VAE and the fine-tuned depth VAE, respectively. Next, the U-Net is fine-tuned by optimizing the standard diffusion objective with respect to the depth latent code. To enable image conditioning, the two latent codes are concatenated before being fed into the U-Net, and the first layer of the U-Net is modified to accept the concatenated latent inputs.