Table of Contents
Fetching ...

UniDAC: Universal Metric Depth Estimation for Any Camera

Girish Chandar Ganesan, Yuliang Guo, Liu Ren, Xiaoming Liu

Abstract

Monocular metric depth estimation (MMDE) is a core challenge in computer vision, playing a pivotal role in real-world applications that demand accurate spatial understanding. Although prior works have shown promising zero-shot performance in MMDE, they often struggle with generalization across diverse camera types, such as fisheye and $360^\circ$ cameras. Recent advances have addressed this through unified camera representations or canonical representation spaces, but they require either including large-FoV camera data during training or separately trained models for different domains. We propose UniDAC, an MMDE framework that presents universal robustness in all domains and generalizes across diverse cameras using a single model. We achieve this by decoupling metric depth estimation into relative depth prediction and spatially varying scale estimation, enabling robust performance across different domains. We propose a lightweight Depth-Guided Scale Estimation module that upsamples a coarse scale map to high resolution using the relative depth map as guidance to account for local scale variations. Furthermore, we introduce RoPE-$φ$, a distortion-aware positional embedding that respects the spatial warping in Equi-Rectangular Projections (ERP) via latitude-aware weighting. UniDAC achieves state of the art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets.

UniDAC: Universal Metric Depth Estimation for Any Camera

Abstract

Monocular metric depth estimation (MMDE) is a core challenge in computer vision, playing a pivotal role in real-world applications that demand accurate spatial understanding. Although prior works have shown promising zero-shot performance in MMDE, they often struggle with generalization across diverse camera types, such as fisheye and cameras. Recent advances have addressed this through unified camera representations or canonical representation spaces, but they require either including large-FoV camera data during training or separately trained models for different domains. We propose UniDAC, an MMDE framework that presents universal robustness in all domains and generalizes across diverse cameras using a single model. We achieve this by decoupling metric depth estimation into relative depth prediction and spatially varying scale estimation, enabling robust performance across different domains. We propose a lightweight Depth-Guided Scale Estimation module that upsamples a coarse scale map to high resolution using the relative depth map as guidance to account for local scale variations. Furthermore, we introduce RoPE-, a distortion-aware positional embedding that respects the spatial warping in Equi-Rectangular Projections (ERP) via latitude-aware weighting. UniDAC achieves state of the art (SoTA) in cross-camera generalization by consistently outperforming prior methods across all datasets.

Paper Structure

This paper contains 29 sections, 13 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: We propose UniDAC, a universal, domain-agnostic metric depth estimation framework that generalizes to any camera. Unlike prior methods that either rely on large-FoV data during training or require separate models for indoor and outdoor domains, UniDAC is trained solely on perspective images yet generalizes effectively to large-FoV inputs, leveraging a universal model to robustly handle both indoor and outdoor environments.
  • Figure 2: We show the Abs.Rel error between the predicted relative depth and the ground truth by performing (a) no scaling, (b) median scaling, and (c) depth guided scaling. Theoretically, the relative and metric depth are aligned with a single scale $s$. Practically, (b) we observe that the irregularities in the relative depth cannot be compensated with a single scalar $s$. Thus, to tackle this, (c) we propose a Depth-Guided Scale Estimation module that predicts a high-resolution scale map $\mathbf{S}$ respecting local variations.
  • Figure 3: Overview of proposed method. UniDAC decouples metric depth estimation into relative depth and scale estimation. Relative depth relies on local scene information, while scene scale is domain-specific and depends on global scene information. Therefore, given an ERP image $\mathbf{I}$, we split the features from the encoder into local $\mathbf{F}_l$ and the global features $\mathbf{F}_g$. We predict the relative depth $\mathbf{D}^\text{rel}$ using the local features $\mathbf{F}_l$. We predict a scale map $\mathbf{S}$ from the global features $\mathbf{F}_g$ to account for the irregularities in $\mathbf{D}^\text{rel}$. We first predict a low-resolution scale map $\mathbf{S}_r$ and obtain the high-resolution $\mathbf{S}$ through our proposed Depth-Guided Scale (DGS) estimation module. The DGS upsamples $\mathbf{S}_r$ by using the $\mathbf{D}^\text{rel}$ as a guide to ensure the upsampling process respects object boundaries. The final metric depth $\mathbf{D}^\text{m}$ is calculated using $\mathbf{D}^\text{rel}$ and $\mathbf{S}$ as shown in \ref{['eq:final_metric_d']}. We introduce distortion-aware positional embedding, termed RoPE-$\phi$, that applies a weight $w(\phi)$ to the RoPE rotations based on the latitude $\phi$. We train using two losses $\mathcal{L}_\text{rel}$ and $\mathcal{L}_\text{m}$ applied on $\mathbf{D}^\text{rel}$ and $\mathbf{D}^\text{m}$, respectively.
  • Figure 4: Depth-Guided Upsampling. We leverage the predicted relative depth $\mathbf{D}^\text{rel}$ as a guide to upsample the predicted low-resolution scale map $\mathbf{S}_r \in \mathbb{R}^{\frac{H}{r}\times\frac{W}{r}}$ to get $\mathbf{S} \in \mathbb{R}^{H\times W}$. We compare $\mathbf{D}^\text{rel}$ and its downsampled version $\mathbf{D}_r$ to get the local information in the form of weights $\mathbf{W} \in \mathbb{R}^{H\times W\times 9}$. We compare the spatial mapping between $\mathbf{S}$ and $\mathbf{S}_r$ and combine it with $\mathbf{W}$ to obtain $\mathbf{S}$. The Depth-Guided Upsampling is non-parametric and thus does not add computational overhead.
  • Figure 5: Motivation for RoPE-$\phi$. We show the difference between (a) the pixel distance in ERP and (b) the corresponding geodesic distance on the curvature of the sphere. Although $|\mathbf{p}_{11}-\mathbf{p}_{12}| = |\mathbf{p}_{21}-\mathbf{p}_{22}|$ in the ERP, we see that $\mathcal{G}(\mathbf{p}_{11},\mathbf{p}_{12}) < \mathcal{G}(\mathbf{p}_{21},\mathbf{p}_{22})$ on the sphere. Geodesic distance respects the actual separation in the 3D space. Thus, we modify 2D-RoPE to reflect the geodesic distance to get RoPE-$\phi$.
  • ...and 4 more figures