Table of Contents
Fetching ...

Wid3R: Wide Field-of-View 3D Reconstruction via Camera Model Conditioning

Dongki Jung, Jaehoon Choi, Adil Qureshi, Somi Jeong, Dinesh Manocha, Suyong Yeon

TL;DR

Wid3R tackles multi-view 3D reconstruction for wide field-of-view cameras by introducing a feed-forward, distortion-aware framework that uses a spherical-harmonics ray-space representation and a trainable camera-model token. By representing geometry per-view in local coordinates and predicting a pencil of rays together with radial distances, the method aligns views with a single scale factor $s^*$ and avoids explicit rectification. The approach unifies pinhole, fisheye, and 360° camera inputs and demonstrates strong zero-shot pose robustness across diverse datasets, achieving state-of-the-art or competitive results in depth estimation, camera localization, and 3D point reconstruction, including challenging 360° scenes. This has practical impact for robotics, AR/VR, and large-scale mapping, enabling robust, fast 3D geometry from distorted wide-FOV imagery without extensive calibration or preprocessing.

Abstract

We present Wid3R, a feed-forward neural network for visual geometry reconstruction that supports wide field-of-view camera models. Prior methods typically assume that input images are rectified or captured with pinhole cameras, since both their architectures and training datasets are tailored to perspective images only. These assumptions limit their applicability in real-world scenarios that use fisheye or panoramic cameras and often require careful calibration and undistortion. In contrast, Wid3R is a generalizable multi-view 3D estimation method that can model wide field-of-view camera types. Our approach leverages a ray representation with spherical harmonics and a novel camera model token within the network, enabling distortion-aware 3D reconstruction. Furthermore, Wid3R is the first multi-view foundation model to support feed-forward 3D reconstruction directly from 360 imagery. It demonstrates strong zero-shot robustness and consistently outperforms prior methods, achieving improvements of up to +77.33 on Stanford2D3D.

Wid3R: Wide Field-of-View 3D Reconstruction via Camera Model Conditioning

TL;DR

Wid3R tackles multi-view 3D reconstruction for wide field-of-view cameras by introducing a feed-forward, distortion-aware framework that uses a spherical-harmonics ray-space representation and a trainable camera-model token. By representing geometry per-view in local coordinates and predicting a pencil of rays together with radial distances, the method aligns views with a single scale factor and avoids explicit rectification. The approach unifies pinhole, fisheye, and 360° camera inputs and demonstrates strong zero-shot pose robustness across diverse datasets, achieving state-of-the-art or competitive results in depth estimation, camera localization, and 3D point reconstruction, including challenging 360° scenes. This has practical impact for robotics, AR/VR, and large-scale mapping, enabling robust, fast 3D geometry from distorted wide-FOV imagery without extensive calibration or preprocessing.

Abstract

We present Wid3R, a feed-forward neural network for visual geometry reconstruction that supports wide field-of-view camera models. Prior methods typically assume that input images are rectified or captured with pinhole cameras, since both their architectures and training datasets are tailored to perspective images only. These assumptions limit their applicability in real-world scenarios that use fisheye or panoramic cameras and often require careful calibration and undistortion. In contrast, Wid3R is a generalizable multi-view 3D estimation method that can model wide field-of-view camera types. Our approach leverages a ray representation with spherical harmonics and a novel camera model token within the network, enabling distortion-aware 3D reconstruction. Furthermore, Wid3R is the first multi-view foundation model to support feed-forward 3D reconstruction directly from 360 imagery. It demonstrates strong zero-shot robustness and consistently outperforms prior methods, achieving improvements of up to +77.33 on Stanford2D3D.
Paper Structure (17 sections, 12 equations, 5 figures, 7 tables)

This paper contains 17 sections, 12 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Wid3R reconstructs wide field-of-view images in a feed-forward manner, supporting challenging camera models such as fisheye and 360$^\circ$ cameras. It demonstrates strong robustness to distortion, under which previous methods wang2025vggtwang2025pi often fail.
  • Figure 2: Overview of the architecture. Multi-view wide field-of-view images are processed by DINO and aggregated through a feature aggregation module. An appropriate camera model token is selected to condition the network on camera-specific priors. Instead of directly regressing point or depth maps, Wid3R decomposes 3D reconstruction into angular and radial components, predicting ray directions and radial distances that are robust to camera projection distortion. A pose header estimates camera poses, enabling the reconstruction of global 3D points. Pretrained $\pi^{3}$wang2025pi weights are loaded where applicable to accelerate training convergence, while the remaining components are trained from scratch.
  • Figure 3: Composition of camera models and training datasets. The diagram illustrates diverse camera model configurations, including pinhole, fisheye, and 360$^\circ$ cameras, along with representative scenes. Wid3R unifies these camera models within a single framework, enabling robust reconstruction under wide field-of-view settings.
  • Figure 4: Qualitative results of 3D point reconstruction with visual localization. Previous methods estimate 3D points through triangulation, whereas our method directly predicts them in a feed-forward manner. Our approach produces more complete and consistent reconstructions across large-scale Matterport3D chang2017matterport3d scenes.
  • Figure 5: Visualization of feed-forward 3D reconstruction results on wide field-of-view images. Our method demonstrates robust performance on fisheye images from FIORD gunes2025fiord, Zip-NeRF duckworth2023smerf, and ScanNet++ yeshwanth2023scannet++, and 360$^\circ$ images from Matterport3D chang2017matterport3d and Stanford2D3D armeni2017joint.