Table of Contents
Fetching ...

Metric-Solver: Sliding Anchored Metric Depth Estimation from a Single Image

Tao Wen, Jiepeng Wang, Yabo Chen, Shugong Xu, Chi Zhang, Xuelong Li

TL;DR

Metric-Solver tackles metric depth estimation under unknown camera intrinsics and cross-domain depth scales by introducing a sliding anchored representation that partitions depth into a scaled near-field $d_{sn}$ and a tapered far-field $d_{tf}$ around a learnable anchor depth $d_{anchor}$. The method employs a one-encoder, two-decoder architecture (Dinov2-based encoder with DPT-head inspired decoders) and a learnable anchor pool to adapt to scene scale, combined with a depth-reprojection and mask-guided fusion to produce coherent metric depth maps. Through random anchor sampling during training and anchor-guided inference, the model achieves strong cross-dataset generalization, including zero-shot performance on indoor/outdoor benchmarks, often surpassing state-of-the-art baselines. The approach enables robust monocular metric depth estimation across diverse environments, facilitating accurate 3D reconstruction and depth-aware perception in real-world applications.

Abstract

Accurate and generalizable metric depth estimation is crucial for various computer vision applications but remains challenging due to the diverse depth scales encountered in indoor and outdoor environments. In this paper, we introduce Metric-Solver, a novel sliding anchor-based metric depth estimation method that dynamically adapts to varying scene scales. Our approach leverages an anchor-based representation, where a reference depth serves as an anchor to separate and normalize the scene depth into two components: scaled near-field depth and tapered far-field depth. The anchor acts as a normalization factor, enabling the near-field depth to be normalized within a consistent range while mapping far-field depth smoothly toward zero. Through this approach, any depth from zero to infinity in the scene can be represented within a unified representation, effectively eliminating the need to manually account for scene scale variations. More importantly, for the same scene, the anchor can slide along the depth axis, dynamically adjusting to different depth scales. A smaller anchor provides higher resolution in the near-field, improving depth precision for closer objects while a larger anchor improves depth estimation in far regions. This adaptability enables the model to handle depth predictions at varying distances and ensure strong generalization across datasets. Our design enables a unified and adaptive depth representation across diverse environments. Extensive experiments demonstrate that Metric-Solver outperforms existing methods in both accuracy and cross-dataset generalization.

Metric-Solver: Sliding Anchored Metric Depth Estimation from a Single Image

TL;DR

Metric-Solver tackles metric depth estimation under unknown camera intrinsics and cross-domain depth scales by introducing a sliding anchored representation that partitions depth into a scaled near-field and a tapered far-field around a learnable anchor depth . The method employs a one-encoder, two-decoder architecture (Dinov2-based encoder with DPT-head inspired decoders) and a learnable anchor pool to adapt to scene scale, combined with a depth-reprojection and mask-guided fusion to produce coherent metric depth maps. Through random anchor sampling during training and anchor-guided inference, the model achieves strong cross-dataset generalization, including zero-shot performance on indoor/outdoor benchmarks, often surpassing state-of-the-art baselines. The approach enables robust monocular metric depth estimation across diverse environments, facilitating accurate 3D reconstruction and depth-aware perception in real-world applications.

Abstract

Accurate and generalizable metric depth estimation is crucial for various computer vision applications but remains challenging due to the diverse depth scales encountered in indoor and outdoor environments. In this paper, we introduce Metric-Solver, a novel sliding anchor-based metric depth estimation method that dynamically adapts to varying scene scales. Our approach leverages an anchor-based representation, where a reference depth serves as an anchor to separate and normalize the scene depth into two components: scaled near-field depth and tapered far-field depth. The anchor acts as a normalization factor, enabling the near-field depth to be normalized within a consistent range while mapping far-field depth smoothly toward zero. Through this approach, any depth from zero to infinity in the scene can be represented within a unified representation, effectively eliminating the need to manually account for scene scale variations. More importantly, for the same scene, the anchor can slide along the depth axis, dynamically adjusting to different depth scales. A smaller anchor provides higher resolution in the near-field, improving depth precision for closer objects while a larger anchor improves depth estimation in far regions. This adaptability enables the model to handle depth predictions at varying distances and ensure strong generalization across datasets. Our design enables a unified and adaptive depth representation across diverse environments. Extensive experiments demonstrate that Metric-Solver outperforms existing methods in both accuracy and cross-dataset generalization.

Paper Structure

This paper contains 29 sections, 15 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: A gallery of our predictions across various scenarios. The Metric-Solver model effectively addresses different in-the-wild scenes with unknown camera settings. This model delivers precise metric depth predictions across a variety of scenarios, including but not limited to indoor and outdoor scenes, autonomous driving scenarios, and various datasets which are captured by different cameras. The side bar along each depth map indicates the predicted depth range in meters.
  • Figure 2: Method Overview. Given an input image, we first employ a large-scale image encoder to extract latent features, as illustrated in (a). Next, these latent features, combined with the sampled anchor depth from the anchor pool, as shown in (b), are fed into a two-branch decoder. Here, the anchor represents a boundary between near and far, and is divided at the pixel level through the anchor mask $m_{sn}$. During training, all different anchors have a chance to be randomly selected from the pool. Then the two-branch decoder predicts scaled near depth $d_{sn}$, anchor mask $m_{sn}$, and tapered far depth $d_{tf}$, as depicted in (c). Finally, the two depth representations are fused using the mask to generate the final complete depth prediction, as demonstrated in (d).
  • Figure 3: Qualitative comparisons of depth predictions on the indoor dataset NYU. We show both depth maps and corresponding error maps. When dealing with large-scale and long-distance indoor scenes, our framework achieves better absolute depth recovery.
  • Figure 4: Qualitative comparisons of different ablation settings. Compared with the baseline settings (c), our full setting (f) allows for effective observing further distances (e.g., sky in the second row). And the anchor mask-based fusion strategy ensures seamless stitching of near and far depths (d) and higher depth fidelity in near-range indoor scenes (e) in indoor scenes.
  • Figure 5: Qualitative comparisons of different reference anchor depth.It can be observed that anchors at different distances allow the near head to precisely focus on depths within different ranges and provide accurate anchor depth masks.
  • ...and 4 more figures