Table of Contents
Fetching ...

SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation

Duc-Hai Pham, Tung Do, Phong Nguyen, Binh-Son Hua, Khoi Nguyen, Rang Nguyen

TL;DR

SharpDepth is proposed, a novel approach to monocular metric depth estimation that combines the metric accuracy of discriminative depth estimation methods with the fine-grained boundary sharpness typically achieved by generative methods, resulting in depth predictions that are both metrically precise and visually sharp.

Abstract

We propose SharpDepth, a novel approach to monocular metric depth estimation that combines the metric accuracy of discriminative depth estimation methods (e.g., Metric3D, UniDepth) with the fine-grained boundary sharpness typically achieved by generative methods (e.g., Marigold, Lotus). Traditional discriminative models trained on real-world data with sparse ground-truth depth can accurately predict metric depth but often produce over-smoothed or low-detail depth maps. Generative models, in contrast, are trained on synthetic data with dense ground truth, generating depth maps with sharp boundaries yet only providing relative depth with low accuracy. Our approach bridges these limitations by integrating metric accuracy with detailed boundary preservation, resulting in depth predictions that are both metrically precise and visually sharp. Our extensive zero-shot evaluations on standard depth estimation benchmarks confirm SharpDepth effectiveness, showing its ability to achieve both high depth accuracy and detailed representation, making it well-suited for applications requiring high-quality depth perception across diverse, real-world environments.

SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation

TL;DR

SharpDepth is proposed, a novel approach to monocular metric depth estimation that combines the metric accuracy of discriminative depth estimation methods with the fine-grained boundary sharpness typically achieved by generative methods, resulting in depth predictions that are both metrically precise and visually sharp.

Abstract

We propose SharpDepth, a novel approach to monocular metric depth estimation that combines the metric accuracy of discriminative depth estimation methods (e.g., Metric3D, UniDepth) with the fine-grained boundary sharpness typically achieved by generative methods (e.g., Marigold, Lotus). Traditional discriminative models trained on real-world data with sparse ground-truth depth can accurately predict metric depth but often produce over-smoothed or low-detail depth maps. Generative models, in contrast, are trained on synthetic data with dense ground truth, generating depth maps with sharp boundaries yet only providing relative depth with low accuracy. Our approach bridges these limitations by integrating metric accuracy with detailed boundary preservation, resulting in depth predictions that are both metrically precise and visually sharp. Our extensive zero-shot evaluations on standard depth estimation benchmarks confirm SharpDepth effectiveness, showing its ability to achieve both high depth accuracy and detailed representation, making it well-suited for applications requiring high-quality depth perception across diverse, real-world environments.

Paper Structure

This paper contains 16 sections, 7 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: We present SharpDepth, a diffusion-based depth model for refining metric depth estimators, e.g., UniDepth piccinelli2024unidepth, without relying on ground-truth depth data. Our method can recover sharp details in thin structures and improve overall point cloud quality.
  • Figure 2: The performance of SOTA depth estimation models in terms of depth accuracy (x-axis) on KITTI baruch2021arkitscenes and DBE Completion (y-axis) on Sintelsintel, UnrealStereo4K unrealstereo and Spring mehl2023spring. Our method (SharpDepth) is best balanced on both axes.
  • Figure 3: Our framework utilizes a diffusion-based estimator and a metric depth estimator to generate affine-invariant and metric depth maps, respectively. A Noise-Aware Gating mechanism produces a selectively noisy latent map, which is fed into our SharpDepth model. The training pipeline uses Score Distillation Sampling and Noise-Aware Reconstruction Losses to refine accuracy and enhance details.
  • Figure 4: The difference map between the Unidepth and Lotus predictions. The high-difference (brighter) areas are heavily distorted by noise, whereas in the low-difference (darker) areas, some information about the wheel is still recognizable.
  • Figure 5: Zero-shot qualitative results on unseen test samples of KITTI geiger2013vision and DIODE vasiljevic2019diode dataset. Our method strikes a balance between depth accuracy and details. UniDepth lacks several details while UniDepth-aligned Lotus is less accurate.
  • ...and 12 more figures