Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera

Yuliang Guo; Sparsh Garg; S. Mahdi H. Miangoleh; Xinyu Huang; Liu Ren

Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera

Yuliang Guo, Sparsh Garg, S. Mahdi H. Miangoleh, Xinyu Huang, Liu Ren

TL;DR

DAC addresses zero-shot metric depth estimation across diverse camera FoVs by training exclusively on perspective data and translating all inputs into a unified ERP space. The method introduces a pitch-aware Image-to-ERP conversion, FoV alignment, and multi-resolution training to simulate large-FoV observations and align heterogeneous FoVs for robust generalization. Empirical results show SoTA zero-shot performance on large FoV datasets, with up to 50% improvement in $δ_1$ on indoor fisheye and 360° data, and strong cross-camera adaptability across backbone architectures. By enabling the reuse of existing 3D datasets from various camera types, DAC facilitates scalable and practical metric depth estimation for real-world applications.

Abstract

While recent depth foundation models exhibit strong zero-shot generalization, achieving accurate metric depth across diverse camera types-particularly those with large fields of view (FoV) such as fisheye and 360-degree cameras-remains a significant challenge. This paper presents Depth Any Camera (DAC), a powerful zero-shot metric depth estimation framework that extends a perspective-trained model to effectively handle cameras with varying FoVs. The framework is designed to ensure that all existing 3D data can be leveraged, regardless of the specific camera types used in new applications. Remarkably, DAC is trained exclusively on perspective images but generalizes seamlessly to fisheye and 360-degree cameras without the need for specialized training data. DAC employs Equi-Rectangular Projection (ERP) as a unified image representation, enabling consistent processing of images with diverse FoVs. Its core components include pitch-aware Image-to-ERP conversion with efficient online augmentation to simulate distorted ERP patches from undistorted inputs, FoV alignment operations to enable effective training across a wide range of FoVs, and multi-resolution data augmentation to further address resolution disparities between training and testing. DAC achieves state-of-the-art zero-shot metric depth estimation, improving $δ_1$ accuracy by up to 50% on multiple fisheye and 360-degree datasets compared to prior metric depth foundation models, demonstrating robust generalization across camera types.

Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera

TL;DR

on indoor fisheye and 360° data, and strong cross-camera adaptability across backbone architectures. By enabling the reuse of existing 3D datasets from various camera types, DAC facilitates scalable and practical metric depth estimation for real-world applications.

Abstract

accuracy by up to 50% on multiple fisheye and 360-degree datasets compared to prior metric depth foundation models, demonstrating robust generalization across camera types.

Paper Structure (24 sections, 14 equations, 10 figures, 8 tables)

This paper contains 24 sections, 14 equations, 10 figures, 8 tables.

Introduction
Related Works
Zero-Shot Monocular Depth Estimation
Depth Estimation from Large FoV Cameras
Notations and Preliminaries
Depth Any Camera
Pitch-Aware Image-to-ERP Conversion
FoV Alignment
Multi-Resolution Training
Experiments
Experimental Setup
Comparison with the SoTA
Ablation Study
Conclusion
Supplemental Experiments
...and 9 more sections

Figures (10)

Figure 1: We introduce Depth Any Camera (DAC) framework, which leverages large-scale datasets containing perspective camera images to train a single depth estimation model capable of conducting zero-shot metric depth estimation on images captured various types of cameras, including those captured from large FoV fisheye and $360^\circ$ cameras.
Figure 2: Challenges on zero-shot test on large FoV camera images. Metric depth estimation models trained on perspective images (e.g., Metric3Dv2 journals/corr/abs-2404-15506/metric3dv2) experience significant performance degradation when applied to fisheye images. Degradation is less pronounced when using an undistorted portion with a highly limited FoV or its ERP conversion.
Figure 3: Depth Scaling in Canonical Model Conversion and Image Resizing. The apparent 2D size of an object $u$ in an image depends on its 3D dimensions $X$, depth $Z$, and camera focal length $f_x$. Left: Converting a perspective camera model to a canonical model with a different focal length $\hat{f}_x$ requires scaling the depth values proportionally, so $\hat{Z} = \frac{\hat{f}_x Z}{f_x}$. Center: The original camera setup, showing the direct relationship between object size, depth, and focal length. Right: When the camera model is fixed but the image is resized to $u'$, this simulates viewing the same 3D object at a different distance, necessitating depth scaling for accurate metric depth, with $Z' = \frac{u Z}{u'}$.
Figure 4: Depth Any Camera Pipeline. Our DAC framework converts data from any camera type into a canonical ERP space, enabling a model trained solely on perspective images to process large-FoV test data consistently for metric depth estimation. During training, we introduce an effective pitch-aware Image-to-ERP conversion with online data augmentation to simulate high-distortion regions unique to large-FoV images. The proposed FoV-Align process normalizes diverse-FoV data to a predefined ERP patch size, maximizing training efficiency. During inference, images from any camera type are converted into ERP space for depth estimation, with an optional step to map the ERP output back to the original image space for visualization.
Figure 5: Pitch-Aware ERP Conversion and FoV Alignment.Top: Grid Sampling is applied for an efficient Image-to-ERP conversion. Each ERP grid sample's corresponding location in the input image is computed using gnomonic geometry and specific camera projection parameters. Given the patch center latitude $\lambda$ determined by the camera's pitch angle, it makes the converted patch to represent high-distortion regions in the ERP space. Bottom: The FoV-Align process normalizes diverse-FoV ERP patches (shown in red and green) to match the height of a single predefined ERP patch (outlined in blue), ensuring efficient training.
...and 5 more figures

Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera

TL;DR

Abstract

Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera

Authors

TL;DR

Abstract

Table of Contents

Figures (10)