Table of Contents
Fetching ...

CAM3R: Camera-Agnostic Model for 3D Reconstruction

Namitha Guruprasad, Abhay Yadav, Cheng Peng, Rama Chellappa

Abstract

Recovering dense 3D geometry from unposed images remains a foundational challenge in computer vision. Current state-of-the-art models are predominantly trained on perspective datasets, which implicitly constrains them to a standard pinhole camera geometry. As a result, these models suffer from significant geometric degradation when applied to wide-angle imagery captured via non-rectilinear optics, such as fisheye or panoramic sensors. To address this, we present CAM3R, a Camera-Agnostic, feed-forward Model for 3D Reconstruction capable of processing images from wide-angle camera models without prior calibration. Our framework consists of a two-view network which is bifurcated into a Ray Module (RM) to estimate per-pixel ray directions and a Cross-view Module (CVM) to infer radial distance with confidence maps, pointmaps, and relative poses. To unify these pairwise predictions into a consistent 3D scene, we introduce a Ray-Aware Global Alignment framework for pose refinement and scale optimization while strictly preserving the predicted local geometry. Extensive experiments on various camera model datasets, including panorama, fisheye and pinhole imagery, demonstrate that CAM3R establishes a new state-of-the-art in pose estimation and reconstruction.

CAM3R: Camera-Agnostic Model for 3D Reconstruction

Abstract

Recovering dense 3D geometry from unposed images remains a foundational challenge in computer vision. Current state-of-the-art models are predominantly trained on perspective datasets, which implicitly constrains them to a standard pinhole camera geometry. As a result, these models suffer from significant geometric degradation when applied to wide-angle imagery captured via non-rectilinear optics, such as fisheye or panoramic sensors. To address this, we present CAM3R, a Camera-Agnostic, feed-forward Model for 3D Reconstruction capable of processing images from wide-angle camera models without prior calibration. Our framework consists of a two-view network which is bifurcated into a Ray Module (RM) to estimate per-pixel ray directions and a Cross-view Module (CVM) to infer radial distance with confidence maps, pointmaps, and relative poses. To unify these pairwise predictions into a consistent 3D scene, we introduce a Ray-Aware Global Alignment framework for pose refinement and scale optimization while strictly preserving the predicted local geometry. Extensive experiments on various camera model datasets, including panorama, fisheye and pinhole imagery, demonstrate that CAM3R establishes a new state-of-the-art in pose estimation and reconstruction.
Paper Structure (34 sections, 18 equations, 10 figures, 7 tables)

This paper contains 34 sections, 18 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: CAM3R provides a robust, feed-forward 3D reconstruction for two-view and multi-view scenarios across disparate optical manifolds, including pinhole, fisheye and panoramic cameras, where recent 3D foundation models fail. Above, we highlight CAM3R's performance on unseen scenes from the 360Loc dataset, visualizing both raw two-view predictions and multi-view reconstructions.
  • Figure 2: CAM3R Overview. Given an input image pair $(I_1, I_2)$, the framework operates through two parallel streams. The Shared Ray Module recovers the internal camera geometry by regressing Spherical Harmonic coefficients to reconstruct continuous ray directional fields $\mathbf{d}_i$. Simultaneously, the Cross-view Module extracts features and utilizes a dual-block transformer decoder to facilitate information exchange between the two views. Specialized DPT heads then regress radial distances $\mathbf{r}_i$ with confidence maps $\sigma_i$, while a Relative Pose Network estimates the rigid transformation $P_{2 \to 1}$. The local pointmaps $\mathbf{X}^{i,i}$ are generated by fusing rays $\mathbf{d}_i$ with radial distances $\mathbf{r}_i$. Finally, the second view is transformed into the reference coordinate frame of the first view via $P_{2 \to 1}$ to produce the globally aligned 3D reconstruction.
  • Figure 3: Qualitative Two-View Reconstructions. Visualization of 3D point clouds for image pairs across diverse optical manifolds (panorama, fisheye, pinhole). Despite extreme radial distortions and camera geometries, relative poses are accurately recovered and structural consistency is maintained. Note this is the raw output of the network.
  • Figure 4: Qualitative Pruning Analysis. From left to right: (1) Successful rejection of a non-overlapping pair; (2) A valid pair with dense 3D correspondences; (3) Rejection of a doppelganger case where visually similar computer monitors yield inconsistent relative geometry.
  • Figure 5: Qualitative Multi-View Reconstructions. Global camera trajectories and dense point clouds recovered from unstructured image pools across diverse datasets. Despite high radial distortion and lack of scenegraph information, globally consistent poses and structural geometry are maintained through the Ray-Aware Global Alignment, effectively mitigating trajectory drift and scale ambiguity.
  • ...and 5 more figures