Wid3R: Wide Field-of-View 3D Reconstruction via Camera Model Conditioning
Dongki Jung, Jaehoon Choi, Adil Qureshi, Somi Jeong, Dinesh Manocha, Suyong Yeon
TL;DR
Wid3R tackles multi-view 3D reconstruction for wide field-of-view cameras by introducing a feed-forward, distortion-aware framework that uses a spherical-harmonics ray-space representation and a trainable camera-model token. By representing geometry per-view in local coordinates and predicting a pencil of rays together with radial distances, the method aligns views with a single scale factor $s^*$ and avoids explicit rectification. The approach unifies pinhole, fisheye, and 360° camera inputs and demonstrates strong zero-shot pose robustness across diverse datasets, achieving state-of-the-art or competitive results in depth estimation, camera localization, and 3D point reconstruction, including challenging 360° scenes. This has practical impact for robotics, AR/VR, and large-scale mapping, enabling robust, fast 3D geometry from distorted wide-FOV imagery without extensive calibration or preprocessing.
Abstract
We present Wid3R, a feed-forward neural network for visual geometry reconstruction that supports wide field-of-view camera models. Prior methods typically assume that input images are rectified or captured with pinhole cameras, since both their architectures and training datasets are tailored to perspective images only. These assumptions limit their applicability in real-world scenarios that use fisheye or panoramic cameras and often require careful calibration and undistortion. In contrast, Wid3R is a generalizable multi-view 3D estimation method that can model wide field-of-view camera types. Our approach leverages a ray representation with spherical harmonics and a novel camera model token within the network, enabling distortion-aware 3D reconstruction. Furthermore, Wid3R is the first multi-view foundation model to support feed-forward 3D reconstruction directly from 360 imagery. It demonstrates strong zero-shot robustness and consistently outperforms prior methods, achieving improvements of up to +77.33 on Stanford2D3D.
