Cameras as Rays: Pose Estimation via Ray Diffusion
Jason Y. Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, Shubham Tulsiani
TL;DR
This work reframes camera pose estimation from a sparse set of views by representing each camera as a bundle of rays (Plücker coordinates) associated with image patches, enabling patch-level, set-to-set inference via transformers. It introduces a deterministic Ray Regression model and a probabilistic Denoising Ray Diffusion model to predict and sample ray bundles, respectively, thereby capturing multi-modal pose uncertainties. The approach achieves state-of-the-art results on CO3D, generalizes to unseen categories and in-the-wild captures, and demonstrates that ray-based representations can yield finer localization and robust performance where traditional global extrinsic regressions struggle. The combination of distributed ray parametrization with diffusion-based uncertainty modeling offers a principled pathway for reliable sparse-view camera pose estimation in realistic settings.
Abstract
Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparsely sampled views (<10). In contrast to existing approaches that pursue top-down prediction of global parametrizations of camera extrinsics, we propose a distributed representation of camera pose that treats a camera as a bundle of rays. This representation allows for a tight coupling with spatial image features improving pose precision. We observe that this representation is naturally suited for set-level transformers and develop a regression-based approach that maps image patches to corresponding rays. To capture the inherent uncertainties in sparse-view pose inference, we adapt this approach to learn a denoising diffusion model which allows us to sample plausible modes while improving performance. Our proposed methods, both regression- and diffusion-based, demonstrate state-of-the-art performance on camera pose estimation on CO3D while generalizing to unseen object categories and in-the-wild captures.
