Table of Contents
Fetching ...

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, Bingyi Kang

TL;DR

Prompt Depth Anything reframes metric depth estimation as prompting a depth foundation model with a metric LiDAR cue, achieving high-resolution (up to 4K) metric depth. A concise multi-scale prompt fusion integrates LiDAR signals into the DPT decoder, preserving the backbone's capabilities while delivering accurate scale information. A scalable training pipeline combines synthetic LiDAR simulations with real-data pseudo GT depth via Zip-NeRF and an edge-aware loss to bridge data gaps and preserve edges. Empirical results on ARKitScenes and ScanNet++ show state-of-the-art performance and strong zero-shot generalization, with demonstrated benefits for 3D reconstruction and robotic grasping. The work offers a practical, extensible framework for metric-depth in real-world scenes using widely available LiDAR prompts and adaptable depth foundation models.

Abstract

Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

TL;DR

Prompt Depth Anything reframes metric depth estimation as prompting a depth foundation model with a metric LiDAR cue, achieving high-resolution (up to 4K) metric depth. A concise multi-scale prompt fusion integrates LiDAR signals into the DPT decoder, preserving the backbone's capabilities while delivering accurate scale information. A scalable training pipeline combines synthetic LiDAR simulations with real-data pseudo GT depth via Zip-NeRF and an edge-aware loss to bridge data gaps and preserve edges. Empirical results on ARKitScenes and ScanNet++ show state-of-the-art performance and strong zero-shot generalization, with demonstrated benefits for 3D reconstruction and robotic grasping. The work offers a practical, extensible framework for metric-depth in real-world scenes using widely available LiDAR prompts and adaptable depth foundation models.

Abstract

Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.

Paper Structure

This paper contains 48 sections, 2 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Illustration and capabilities of Prompt Depth Anything. (a) Prompt Depth Anything is a new paradigm for metric depth estimation, which is formulated as prompting a depth foundation model with a metric prompt, specifically utilizing a low-cost LiDAR as the prompt. (b) Our method enables consistent depth estimation, addressing the limitations of Metric3D v2 hu2024metric3dv2 that suffer from inaccurate scale and inconsistency. (c) It achieves accurate 4K accurate depth estimation, significantly surpassing ARKit LiDAR Depth (240 $\times$ 320).
  • Figure 2: Overview of Prompt Depth Anything. (a) Prompt Depth Anything builds on a depth foundation model yang2024depthv2 with a ViT encoder and a DPT decoder, and adds a multi-scale prompt fusion design, using a prompt fusion block to fuse the metric information at each scale. (b) Since training requires both low-cost LiDAR and precise GT depth, we propose a scalable data pipeline that simulates LiDAR depth for synthetic data with precise GT depth, and generates pseudo GT depth for real data with LiDAR. An edge-aware depth loss is proposed to merge accurate edges from pseudo GT depth with accurate depth in textureless areas from FARO annotated GT depth on real data.
  • Figure 3: Effects on the synthetic data lidar simulation and real data pseudo GT generation with the edge-aware depth loss. The middle and right columns are the depth prediction results of our different models. The two rows highlight the significance of sparse anchor interpolation for lidar simulation and pseudo GT generation with edge-aware depth loss, respectively.
  • Figure 4: Qualitative comparisons with the state-of-the-art. "Metric3D v2" and "Depth Any. v2" are scale-shift corrected with ARKit depth. The pink boxes denote the GT depth and depth percentage error map, where red represents high error, and blue indicates low error.
  • Figure 5: Qualitative comparisons of TSDF reconstruction. *_align denotes the scale-shift corrected depth with ARKit depth.
  • ...and 11 more figures