Table of Contents
Fetching ...

ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation

Ruijie Zhu, Chuxin Wang, Ziyang Song, Li Liu, Tianzhu Zhang, Yongdong Zhang

TL;DR

ScaleDepth tackles monocular metric depth estimation across diverse indoor and outdoor scenes by decomposing depth into scene scale $S$ and relative depth $R$, yielding $M = S \times R$. It introduces SASP to predict $S$ via semantic-structural cues and CLIP-based text-image similarity, and ARDE to estimate $R$ in a normalized $0$-$1$ depth space using bin-based, mask-guided attention. A joint loss combines Scale-Invariant depth loss with a Text-Image similarity term, and the model achieves state-of-the-art results in indoor, outdoor, unconstrained, and unseen scenarios without predefined depth ranges. The approach offers strong zero-shot generalization and practical impact for robotics, AR/VR, and 3D reconstruction by providing accurate metric depth across diverse environments without dataset-specific tuning. Overall, ScaleDepth demonstrates a robust framework for universal monocular depth estimation by explicitly modeling scene scale and depth distribution adaptively.

Abstract

Estimating depth from a single image is a challenging visual task. Compared to relative depth estimation, metric depth estimation attracts more attention due to its practical physical significance and critical applications in real-life scenarios. However, existing metric depth estimation methods are typically trained on specific datasets with similar scenes, facing challenges in generalizing across scenes with significant scale variations. To address this challenge, we propose a novel monocular depth estimation method called ScaleDepth. Our method decomposes metric depth into scene scale and relative depth, and predicts them through a semantic-aware scale prediction (SASP) module and an adaptive relative depth estimation (ARDE) module, respectively. The proposed ScaleDepth enjoys several merits. First, the SASP module can implicitly combine structural and semantic features of the images to predict precise scene scales. Second, the ARDE module can adaptively estimate the relative depth distribution of each image within a normalized depth space. Third, our method achieves metric depth estimation for both indoor and outdoor scenes in a unified framework, without the need for setting the depth range or fine-tuning model. Extensive experiments demonstrate that our method attains state-of-the-art performance across indoor, outdoor, unconstrained, and unseen scenes. Project page: https://ruijiezhu94.github.io/ScaleDepth

ScaleDepth: Decomposing Metric Depth Estimation into Scale Prediction and Relative Depth Estimation

TL;DR

ScaleDepth tackles monocular metric depth estimation across diverse indoor and outdoor scenes by decomposing depth into scene scale and relative depth , yielding . It introduces SASP to predict via semantic-structural cues and CLIP-based text-image similarity, and ARDE to estimate in a normalized - depth space using bin-based, mask-guided attention. A joint loss combines Scale-Invariant depth loss with a Text-Image similarity term, and the model achieves state-of-the-art results in indoor, outdoor, unconstrained, and unseen scenarios without predefined depth ranges. The approach offers strong zero-shot generalization and practical impact for robotics, AR/VR, and 3D reconstruction by providing accurate metric depth across diverse environments without dataset-specific tuning. Overall, ScaleDepth demonstrates a robust framework for universal monocular depth estimation by explicitly modeling scene scale and depth distribution adaptively.

Abstract

Estimating depth from a single image is a challenging visual task. Compared to relative depth estimation, metric depth estimation attracts more attention due to its practical physical significance and critical applications in real-life scenarios. However, existing metric depth estimation methods are typically trained on specific datasets with similar scenes, facing challenges in generalizing across scenes with significant scale variations. To address this challenge, we propose a novel monocular depth estimation method called ScaleDepth. Our method decomposes metric depth into scene scale and relative depth, and predicts them through a semantic-aware scale prediction (SASP) module and an adaptive relative depth estimation (ARDE) module, respectively. The proposed ScaleDepth enjoys several merits. First, the SASP module can implicitly combine structural and semantic features of the images to predict precise scene scales. Second, the ARDE module can adaptively estimate the relative depth distribution of each image within a normalized depth space. Third, our method achieves metric depth estimation for both indoor and outdoor scenes in a unified framework, without the need for setting the depth range or fine-tuning model. Extensive experiments demonstrate that our method attains state-of-the-art performance across indoor, outdoor, unconstrained, and unseen scenes. Project page: https://ruijiezhu94.github.io/ScaleDepth
Paper Structure (14 sections, 7 equations, 9 figures, 12 tables)

This paper contains 14 sections, 7 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Examples of various scenes and objects with different depths. Scenes of different categories typically exhibit large scale variations (a, b, and c), while scenes of the same category have similar scales (c and d). Same objects have varying depths within the same scene due to their different placement (e).
  • Figure 2: Within a unified framework, our method ScaleDepth achieves both accurate indoor and outdoor metric depth estimation without setting depth ranges or finetuning models. Left: the input RGB image and corresponding depth prediction. Right: the comparison of model parameters and performance. With overall fewer parameters, our model ScaleDepth-NK significantly outperforms the state-of-the-art methods under same experimental settings.
  • Figure 3: The overall architecture of the proposed ScaleDepth. We design bin queries to predict relative depth distribution and scale queries to predict scene scale. During training, we preset text prompts containing 28 scene categories as input to the frozen CLIP text encoder. We then calculate the similarity between the updated scale queries and text embedding, and utilize the scene category as its auxiliary supervision. During inference, only a single image is required to obtain the relative depth and scene scale, thereby synthesizing a metric depth map.
  • Figure 4: The qualitative comparison on NYU-Depth V2 dataset. For each test sample pair, the left is the depth map and the right is the error map. In each map, blue corresponds to lower (metric depth or error) values and red to higher values.
  • Figure 6: The qualitative comparison of 3D point clouds reconstructed by the predicted metric depth on NYU-Depth V2 dataset. Each row corresponds to a test sample. We use the same camera parameters to project the metric depth, and use the same viewpoints to visualize point clouds. The red regions highlight that our method recovers more detailed and complete structure of the scenes.
  • ...and 4 more figures