Table of Contents
Fetching ...

UniDepth: Universal Monocular Metric Depth Estimation

Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, Fisher Yu

TL;DR

UniDepth addresses the challenge of universal monocular metric depth estimation by predicting per-pixel metric 3D points from a single image without relying on external camera parameters. It introduces a self-promptable dense camera module and a pseudo-spherical output space that cleanly separates camera rays from depth, aided by a geometric invariance loss to enforce consistency across geometric augmentations. Trained on a large, diverse real-world dataset and evaluated zero-shot across ten unseen datasets, UniDepth achieves state-of-the-art performance, particularly in scale-invariant metrics, and even tops the KITTI depth prediction benchmark. The work demonstrates robust 3D reconstruction across varied scenes and camera setups, with flexible test-time conditioning if additional camera information is available.

Abstract

Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepth, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE methods, UniDepth directly predicts metric 3D points from the input image at inference time without any additional information, striving for a universal and flexible MMDE solution. In particular, UniDepth implements a self-promptable camera module predicting dense camera representation to condition depth features. Our model exploits a pseudo-spherical output representation, which disentangles camera and depth representations. In addition, we propose a geometric invariance loss that promotes the invariance of camera-prompted depth features. Thorough evaluations on ten datasets in a zero-shot regime consistently demonstrate the superior performance of UniDepth, even when compared with methods directly trained on the testing domains. Code and models are available at: https://github.com/lpiccinelli-eth/unidepth

UniDepth: Universal Monocular Metric Depth Estimation

TL;DR

UniDepth addresses the challenge of universal monocular metric depth estimation by predicting per-pixel metric 3D points from a single image without relying on external camera parameters. It introduces a self-promptable dense camera module and a pseudo-spherical output space that cleanly separates camera rays from depth, aided by a geometric invariance loss to enforce consistency across geometric augmentations. Trained on a large, diverse real-world dataset and evaluated zero-shot across ten unseen datasets, UniDepth achieves state-of-the-art performance, particularly in scale-invariant metrics, and even tops the KITTI depth prediction benchmark. The work demonstrates robust 3D reconstruction across varied scenes and camera setups, with flexible test-time conditioning if additional camera information is available.

Abstract

Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepth, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE methods, UniDepth directly predicts metric 3D points from the input image at inference time without any additional information, striving for a universal and flexible MMDE solution. In particular, UniDepth implements a self-promptable camera module predicting dense camera representation to condition depth features. Our model exploits a pseudo-spherical output representation, which disentangles camera and depth representations. In addition, we propose a geometric invariance loss that promotes the invariance of camera-prompted depth features. Thorough evaluations on ten datasets in a zero-shot regime consistently demonstrate the superior performance of UniDepth, even when compared with methods directly trained on the testing domains. Code and models are available at: https://github.com/lpiccinelli-eth/unidepth
Paper Structure (22 sections, 5 equations, 7 figures, 13 tables, 1 algorithm)

This paper contains 22 sections, 5 equations, 7 figures, 13 tables, 1 algorithm.

Figures (7)

  • Figure 1: We introduce UniDepth, a novel approach that directly predicts 3D points in a scene with only one image as input. UniDepth incorporates a camera self-prompting mechanism and leverages a pseudo-spherical 3D output space defined by azimuth and elevation angles, and depth ($\theta$, $\phi$, $z$). This design effectively separates camera and depth optimization by avoiding gradient flowing to the camera module due to depth-related error ($\varepsilon_z$).
  • Figure 2: Model Architecture. UniDepth utilizes solely the input image to generate the 3D output ($\mathbf{O}$). It bootstraps dense camera prediction ($\mathbf{C}$) from the Camera Module, injecting prior knowledge on scene scale into the Depth Module via a cross-attention layer. The camera representation corresponds to azimuth and elevation angles. The geometric invariance loss ($\mathcal{L}_{\mathrm{con}}$) enforces consistency between depth features tensors conditioned on the camera from different geometric augmentations ($\mathcal{T}_1$, $\mathcal{T}_2$). Stop-gradient is applied to the encoded feature ($\mathbf{F}$) flowing to the Camera Module to prevent the camera gradient from dominating the depth gradient in the encoder. The depth output ($\mathbf{Z}_{\log}$) is obtained through three self-attention blocks interleaved with learnable $2\mathrm{x}$ upsampling. The final output is the concatenation of the camera and depth tensors ($\mathbf{C} || \mathbf{Z}_{\log}$), creating two independent optimization spaces for $\mathcal{L}_{\lambda MSE}$.
  • Figure 3: Impact of noise in camera intrinsics. The amount of relative distortion ($\varepsilon_{\mathrm{CAM} (\%)}$) of the intrinsics is shown on the x-axis, while $\delta_{0.5}$ performance on OOD test sets on the y-axis. Relying on external input inherently leads to being subject to its noise. UniDepth functions in dual regimes, with and without external intrinsic. In situations of unknown intrinsics or high noise, UniDepth exhibits total robustness by bootstrapping camera prediction (Ours). In contrast, with low-noise intrinsics, we leverage it for enhanced peak performance (Ours-CAM).
  • Figure 4: Zero-shot qualitative results. Each pair of consecutive rows corresponds to one test sample. Each odd row shows the input RGB image and the predicted pointcloud color-coded with coolwarm based on the absolute relative error. Each even row shows GT depth and the predicted depth. The last column represents the specific colormap ranges for depth and error. (†): KITTI and NYU in the training set.
  • Figure 5: Zero-shot qualitative results. Each pair of consecutive rows corresponds to one test sample. Each odd row shows the input RGB image and the absolute relative error map color-coded with coolwarm colormap. Each even row shows GT depth and the predicted depth. The last column represents the specific colormap ranges for depth and error. (†): KITTI and NYU in the training set.
  • ...and 2 more figures