Table of Contents
Fetching ...

DepthMaster: Taming Diffusion Models for Monocular Depth Estimation

Ziyang Song, Zerong Wang, Bo Li, Hao Zhang, Ruijie Zhu, Li Liu, Peng-Tao Jiang, Tianzhu Zhang

TL;DR

DepthMaster addresses the speed-generalization trade-off in monocular depth estimation by casting diffusion priors into a single-step, deterministic framework. It introduces a Feature Alignment module to inject semantic information from external encoders and a Fourier Enhancement module to recover fine details, learned through a two-stage training curriculum that separates structure learning from detail refinement. The approach yields state-of-the-art zero-shot generalization and superior edge/detail preservation across multiple datasets, while delivering fast inference compared to iterative diffusion methods. This work demonstrates how targeted taming of generative features can bridge generative priors and discriminative depth estimation with practical, real-world impact.

Abstract

Monocular depth estimation within the diffusion-denoising paradigm demonstrates impressive generalization ability but suffers from low inference speed. Recent methods adopt a single-step deterministic paradigm to improve inference efficiency while maintaining comparable performance. However, they overlook the gap between generative and discriminative features, leading to suboptimal results. In this work, we propose DepthMaster, a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task. First, to mitigate overfitting to texture details introduced by generative features, we propose a Feature Alignment module, which incorporates high-quality semantic features to enhance the denoising network's representation capability. Second, to address the lack of fine-grained details in the single-step deterministic framework, we propose a Fourier Enhancement module to adaptively balance low-frequency structure and high-frequency details. We adopt a two-stage training strategy to fully leverage the potential of the two modules. In the first stage, we focus on learning the global scene structure with the Feature Alignment module, while in the second stage, we exploit the Fourier Enhancement module to improve the visual quality. Through these efforts, our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets. Our project page can be found at https://indu1ge.github.io/DepthMaster_page.

DepthMaster: Taming Diffusion Models for Monocular Depth Estimation

TL;DR

DepthMaster addresses the speed-generalization trade-off in monocular depth estimation by casting diffusion priors into a single-step, deterministic framework. It introduces a Feature Alignment module to inject semantic information from external encoders and a Fourier Enhancement module to recover fine details, learned through a two-stage training curriculum that separates structure learning from detail refinement. The approach yields state-of-the-art zero-shot generalization and superior edge/detail preservation across multiple datasets, while delivering fast inference compared to iterative diffusion methods. This work demonstrates how targeted taming of generative features can bridge generative priors and discriminative depth estimation with practical, real-world impact.

Abstract

Monocular depth estimation within the diffusion-denoising paradigm demonstrates impressive generalization ability but suffers from low inference speed. Recent methods adopt a single-step deterministic paradigm to improve inference efficiency while maintaining comparable performance. However, they overlook the gap between generative and discriminative features, leading to suboptimal results. In this work, we propose DepthMaster, a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task. First, to mitigate overfitting to texture details introduced by generative features, we propose a Feature Alignment module, which incorporates high-quality semantic features to enhance the denoising network's representation capability. Second, to address the lack of fine-grained details in the single-step deterministic framework, we propose a Fourier Enhancement module to adaptively balance low-frequency structure and high-frequency details. We adopt a two-stage training strategy to fully leverage the potential of the two modules. In the first stage, we focus on learning the global scene structure with the Feature Alignment module, while in the second stage, we exploit the Fourier Enhancement module to improve the visual quality. Through these efforts, our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets. Our project page can be found at https://indu1ge.github.io/DepthMaster_page.
Paper Structure (17 sections, 10 equations, 6 figures, 6 tables)

This paper contains 17 sections, 10 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Visualization of different paradigms. "Denoise" refers to predicting depth in a diffusion-denoising way. Limited by the feature representation capability of the denoising network, predictions tend to overfit texture details and miss the real structure, as highlighted with yellow boxes in Column 3. "Stage1" alleviates this issue with the Feature Alignment module, but suffers from blurry outputs due to removing the iterative process, as highlighted with red boxes in Column 4. "Stage2" presents the final model fine-tuned with the Fourier Enhancement module, which exhibits excellent generalization and fine-grained details.
  • Figure 2: The overall framework of DepthMaster. RGB is first projected into the latent space by the I2L Encoder to obtain $z_{RGB}$. Next, the U-Net converts RGB latent to depth prediction latent $z_{pred}$, which is decoded back to the depth map by the I2L Decoder. The Feature Alignment module is applied in the first stage to align the representation of the U-Net to that of the high-quality external encoder, introducing semantic information into the diffusion model. In the second stage, the Fourier Enhancement module adaptively balances low-frequency structure and high-frequency details to enhance the visual quality.
  • Figure 3: Qualitative comparison with zero-shot monocular depth estimation methods across different datasets. Our model demonstrates excellent detail preservation and structure capture capabilities. Benefiting from the Feature Alignment module, our model avoids overfitting to textures.
  • Figure 4: Qualitative results on in-the-wild examples. Our model not only recovers correct scene structure, but also exhibits fine-grained details.
  • Figure 5: Depth distribution of different depth preprocess methods on Virtual KITTI. Square-root disparity exhibits the most uniform distribution.
  • ...and 1 more figures