Table of Contents
Fetching ...

Cameras as Rays: Pose Estimation via Ray Diffusion

Jason Y. Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, Shubham Tulsiani

TL;DR

This work reframes camera pose estimation from a sparse set of views by representing each camera as a bundle of rays (Plücker coordinates) associated with image patches, enabling patch-level, set-to-set inference via transformers. It introduces a deterministic Ray Regression model and a probabilistic Denoising Ray Diffusion model to predict and sample ray bundles, respectively, thereby capturing multi-modal pose uncertainties. The approach achieves state-of-the-art results on CO3D, generalizes to unseen categories and in-the-wild captures, and demonstrates that ray-based representations can yield finer localization and robust performance where traditional global extrinsic regressions struggle. The combination of distributed ray parametrization with diffusion-based uncertainty modeling offers a principled pathway for reliable sparse-view camera pose estimation in realistic settings.

Abstract

Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparsely sampled views (<10). In contrast to existing approaches that pursue top-down prediction of global parametrizations of camera extrinsics, we propose a distributed representation of camera pose that treats a camera as a bundle of rays. This representation allows for a tight coupling with spatial image features improving pose precision. We observe that this representation is naturally suited for set-level transformers and develop a regression-based approach that maps image patches to corresponding rays. To capture the inherent uncertainties in sparse-view pose inference, we adapt this approach to learn a denoising diffusion model which allows us to sample plausible modes while improving performance. Our proposed methods, both regression- and diffusion-based, demonstrate state-of-the-art performance on camera pose estimation on CO3D while generalizing to unseen object categories and in-the-wild captures.

Cameras as Rays: Pose Estimation via Ray Diffusion

TL;DR

This work reframes camera pose estimation from a sparse set of views by representing each camera as a bundle of rays (Plücker coordinates) associated with image patches, enabling patch-level, set-to-set inference via transformers. It introduces a deterministic Ray Regression model and a probabilistic Denoising Ray Diffusion model to predict and sample ray bundles, respectively, thereby capturing multi-modal pose uncertainties. The approach achieves state-of-the-art results on CO3D, generalizes to unseen categories and in-the-wild captures, and demonstrates that ray-based representations can yield finer localization and robust performance where traditional global extrinsic regressions struggle. The combination of distributed ray parametrization with diffusion-based uncertainty modeling offers a principled pathway for reliable sparse-view camera pose estimation in realistic settings.

Abstract

Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparsely sampled views (<10). In contrast to existing approaches that pursue top-down prediction of global parametrizations of camera extrinsics, we propose a distributed representation of camera pose that treats a camera as a bundle of rays. This representation allows for a tight coupling with spatial image features improving pose precision. We observe that this representation is naturally suited for set-level transformers and develop a regression-based approach that maps image patches to corresponding rays. To capture the inherent uncertainties in sparse-view pose inference, we adapt this approach to learn a denoising diffusion model which allows us to sample plausible modes while improving performance. Our proposed methods, both regression- and diffusion-based, demonstrate state-of-the-art performance on camera pose estimation on CO3D while generalizing to unseen object categories and in-the-wild captures.
Paper Structure (16 sections, 12 equations, 12 figures, 15 tables)

This paper contains 16 sections, 12 equations, 12 figures, 15 tables.

Figures (12)

  • Figure 1: Recovering Sparse-view Camera Parameters by Denoising Rays.Top: Given sparsely sampled images, our approach learns to denoise camera rays (represented using Plücker coordinates). We then recover camera intrinsics and extrinsics from the positions of the rays. Bottom: We demonstrate the generalization of our approach for both seen (teddybear) and unseen object categories (couch, sandwich).
  • Figure 2: Converting Between Camera and Ray Representations. We represent cameras as a collection of 6-D Plücker rays consisting of directions and moments. We convert the traditional representation of cameras to the ray bundle representation by unprojecting rays from the camera center to pixel coordinates. We convert rays back to the traditional camera representation by solving least-squares optimizations for the camera center, intrinsics matrix, and rotation matrix. See \ref{['sec:ray_representation']} for more details.
  • Figure 3: Denoising Ray Diffuser Network. Given a noisy ray corresponding to an image patch, our denoising ray diffusion model predicts the denoised ray. We concatenate spatial image features oquab2023dinov2 with noisy rays, represented with 6-dimensional Plücker coordinates plucker1828analytisch that are visualized as 3-channel direction maps and 3-channel moment maps. We use a transformer to jointly process all image patches and associated noisy rays to predict the original denoised rays.
  • Figure 4: Visualizing the Denoising Process Using Our Ray Diffuser. Given the 2 images of the suitcase (Bottom Right), we visualize the denoising process starting from randomly initialized camera rays. We visualize the noisy rays using the Plücker representation (ray directions and moments) in the bottom row and their corresponding 3D positions in the top row. In the rightmost column, we recover the predicted cameras (green) and compare them to the ground truth cameras (black).
  • Figure 5: Qualitative Comparison Between Predicted Camera Poses. We compare the results of our regression and diffusion approaches with PoseDiffusion and RelPose++. Ground truth (black) camera trajectories are aligned to the predicted (colored) camera trajectories by performing Procrustes optimal alignment on the camera centers. The top two examples are from seen categories, and the bottom two are from held out categories.
  • ...and 7 more figures