MetricGold: Leveraging Text-To-Image Latent Diffusion Models for Metric Depth Estimation
Ansh Shah, K Madhava Krishna
TL;DR
This work tackles monocular metric depth estimation by exploiting the rich priors of pretrained latent diffusion models. MetricGold reframes depth prediction as conditional denoising in a latent diffusion framework built on a Stable Diffusion backbone, incorporating a Depth VAE and a conditioned Denoising U-Net, and trains on photo-realistic synthetic data to achieve strong zero-shot generalization. Key contributions include (i) latent-space depth modeling with image-conditioned latents, (ii) log-depth representation to unify indoor and outdoor scales, and (iii) an efficient training protocol that completes on a single RTX 3090 in about two days. The approach yields sharper, more accurate metric depth estimates across diverse datasets, highlighting the practicality of diffusion priors for scalable, cross-domain depth estimation without real-world depth supervision.
Abstract
Recovering metric depth from a single image remains a fundamental challenge in computer vision, requiring both scene understanding and accurate scaling. While deep learning has advanced monocular depth estimation, current models often struggle with unfamiliar scenes and layouts, particularly in zero-shot scenarios and when predicting scale-ergodic metric depth. We present MetricGold, a novel approach that harnesses generative diffusion model's rich priors to improve metric depth estimation. Building upon recent advances in MariGold, DDVM and Depth Anything V2 respectively, our method combines latent diffusion, log-scaled metric depth representation, and synthetic data training. MetricGold achieves efficient training on a single RTX 3090 within two days using photo-realistic synthetic data from HyperSIM, VirtualKitti, and TartanAir. Our experiments demonstrate robust generalization across diverse datasets, producing sharper and higher quality metric depth estimates compared to existing approaches.
