Table of Contents
Fetching ...

UniK3D: Universal Camera Monocular 3D Estimation

Luigi Piccinelli, Christos Sakaridis, Mattia Segu, Yung-Hsu Yang, Siyuan Li, Wim Abbeloos, Luc Van Gool

TL;DR

UniK3D tackles the challenge of monocular metric 3D estimation across arbitrary camera geometries by introducing a universal, camera-agnostic framework. It adopts a fully spherical output space with radial depth and models the pencil of rays through a spherical-harmonics basis, enabling generalization from pinhole to panoramic cameras without test-time intrinsics. Key contributions include the SH-based camera module, a radial decoder with conditioning strategies, and an asymmetric angular loss to prevent contraction, all validated by zero-shot results on 13 diverse datasets that show strong performance in challenging wide FoV scenarios. The work significantly broadens the applicability of monocular 3D reconstruction to real-world, distortion-heavy imaging, with code and models released for reproducibility.

Abstract

Monocular 3D estimation is crucial for visual perception. However, current methods fall short by relying on oversimplified assumptions, such as pinhole camera models or rectified images. These limitations severely restrict their general applicability, causing poor performance in real-world scenarios with fisheye or panoramic images and resulting in substantial context loss. To address this, we present UniK3D, the first generalizable method for monocular 3D estimation able to model any camera. Our method introduces a spherical 3D representation which allows for better disentanglement of camera and scene geometry and enables accurate metric 3D reconstruction for unconstrained camera models. Our camera component features a novel, model-independent representation of the pencil of rays, achieved through a learned superposition of spherical harmonics. We also introduce an angular loss, which, together with the camera module design, prevents the contraction of the 3D outputs for wide-view cameras. A comprehensive zero-shot evaluation on 13 diverse datasets demonstrates the state-of-the-art performance of UniK3D across 3D, depth, and camera metrics, with substantial gains in challenging large-field-of-view and panoramic settings, while maintaining top accuracy in conventional pinhole small-field-of-view domains. Code and models are available at github.com/lpiccinelli-eth/unik3d .

UniK3D: Universal Camera Monocular 3D Estimation

TL;DR

UniK3D tackles the challenge of monocular metric 3D estimation across arbitrary camera geometries by introducing a universal, camera-agnostic framework. It adopts a fully spherical output space with radial depth and models the pencil of rays through a spherical-harmonics basis, enabling generalization from pinhole to panoramic cameras without test-time intrinsics. Key contributions include the SH-based camera module, a radial decoder with conditioning strategies, and an asymmetric angular loss to prevent contraction, all validated by zero-shot results on 13 diverse datasets that show strong performance in challenging wide FoV scenarios. The work significantly broadens the applicability of monocular 3D reconstruction to real-world, distortion-heavy imaging, with code and models released for reproducibility.

Abstract

Monocular 3D estimation is crucial for visual perception. However, current methods fall short by relying on oversimplified assumptions, such as pinhole camera models or rectified images. These limitations severely restrict their general applicability, causing poor performance in real-world scenarios with fisheye or panoramic images and resulting in substantial context loss. To address this, we present UniK3D, the first generalizable method for monocular 3D estimation able to model any camera. Our method introduces a spherical 3D representation which allows for better disentanglement of camera and scene geometry and enables accurate metric 3D reconstruction for unconstrained camera models. Our camera component features a novel, model-independent representation of the pencil of rays, achieved through a learned superposition of spherical harmonics. We also introduce an angular loss, which, together with the camera module design, prevents the contraction of the 3D outputs for wide-view cameras. A comprehensive zero-shot evaluation on 13 diverse datasets demonstrates the state-of-the-art performance of UniK3D across 3D, depth, and camera metrics, with substantial gains in challenging large-field-of-view and panoramic settings, while maintaining top accuracy in conventional pinhole small-field-of-view domains. Code and models are available at github.com/lpiccinelli-eth/unik3d .

Paper Structure

This paper contains 22 sections, 3 equations, 7 figures, 30 tables.

Figures (7)

  • Figure 1: UniK3D introduces a novel and versatile approach that delivers precise metric 3D geometry estimation from a single image and for any camera type, ranging from pinhole to panoramic, without requiring any camera information. By leveraging (i) a flexible and general spherical formulation both for the radial dimension of 3D space and for the two camera-model-dependent orientation dimensions and (ii) advanced conditioning strategies. UniK3D outperforms traditional models without needing camera calibration or domain-specific tuning.
  • Figure 2: Model architecture. UniK3D utilizes solely the single input image to generate the 3D output point cloud ($\mathbf{O}$) for any camera. The projective geometry of the camera is predicted by the Angular Module. The camera representation corresponds to azimuth and polar angles ($\mathbf{C}$) of the backprojected pencil of rays on the unit sphere $\mathbb{S}^3$. The class tokens from the Encoder are processed by 2 Transformer Encoder (T-Enc) layers to obtain the 15 coefficients ($\mathbf{H}$) of the inverse Spherical transform $\mathcal{F}^{-1}_{\mathcal{B}}\{\mathbf{H}\}$ defined by a finite basis ($\mathcal{B}$) of spherical harmonics up to degree 3 with no constant component. Stop-gradient is applied to the angular information which conditions the Radial Module, simulating external information flow. The "static encoding" refers to sinusoidal encoding which matches the internal feature dimensionality. The Radial Module is composed of Transformer Decoder (T-Dec) blocks, one for each input resolution, which is utilized to condition the Encoder features on the bootstrapped camera representation. This conditioning injects prior knowledge on scene scale and projective geometry. The radial output ($\mathbf{R}_{\log}$) is obtained by processing the camera-aware features via a learnable upsampling module. The final output is the concatenation of the camera and radial tensors ($\mathbf{C} || \mathbf{R}_{\log}$). A closed-form coordinate transform is applied to obtain the Cartesian 3D output, but supervision is applied directly on angular coordinates, with our asymmetric angular loss $\mathcal{L}_{\text{AA}}$, and radial coordinates.
  • Figure 3: Qualitative comparisons. Each pair of consecutive rows represents one test sample. Each odd row displays the input RGB image and the 2D error map, color-coded with the coolwarm colormap based on absolute relative error (for panoramic images, the error is computed on distance rather than depth). To ensure a fair comparison, errors are calculated on GT-based shifted and scaled outputs for all models. Each even row shows the ground truth and predictions of the 3D point cloud. The last column displays the specific colormap ranges for absolute relative error. Key observations for each rows pair: (1) competing methods are limited to only positive depth and heavily distort the scenes for larger FoV; (2) in the case of representable but large FoV (180$\circ$), UniK3D output is the only one that does not suffer from pronounced FoV contraction; (3) for moderate-FoV images but with strong boundary distortion, e.g. fisheye, UniK3D can maintain planarity and overall scene structure; (4) our approach also delivers accurate 3D estimates for standard pinhole images.
  • Figure 4: FoV effects. The image on the left showcases the challenge of representing the full 180$\circ$ FoV, alongside the GT point cloud. The effect of FoV contraction occurs when no "guarding", i.e. asymmetric loss ($\mathcal{L}_{\text{AA}}$) and camera conditioning, is put in force, as shown in a). The total absence of any prior may lead to impossible and inconsistent backprojection, as shown in b). The final UniK3D is depicted in c), clearly showing the ability to recover large FoVs with a sensible camera backprojection model.
  • Figure 5: Qualitative comparisons. Each pair of consecutive rows represents one test sample. Each odd row displays the input RGB image and the 2D error map, color-coded with the coolwarm colormap based on absolute relative error with blue corresponding to 0% error and red to 25%. To ensure a fair comparison, errors are calculated on GT-based shifted and scaled outputs for all models. Each even row shows the ground truth and predictions of the 3D point cloud. All samples are randomly selected and not picked. †: GT-camera unprojection.
  • ...and 2 more figures