Table of Contents
Fetching ...

PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment

Jianyuan Wang, Christian Rupprecht, David Novotny

TL;DR

PoseDiffusion reframes camera pose estimation as sampling from a learned conditional distribution $p(x|\mathtt{I})$ using a diffusion model, enabling joint intrinsic and extrinsic estimation for arbitrary image sets. The method integrates a Transformer-based denoiser with diffusion steps and geometry-guided sampling via Sampson epipolar constraints to iteratively refine camera parameters. It achieves state-of-the-art results on CO3Dv2 and RealEstate10k and shows strong generalization across datasets, including cross-domain transfer to RealEstate10k. The approach also improves downstream novel-view synthesis when used to provide camera parameters for NeRF training, highlighting its practical impact for 3D reconstruction and view synthesis workflows.

Abstract

Camera pose estimation is a long-standing computer vision problem that to date often relies on classical methods, such as handcrafted keypoint matching, RANSAC and bundle adjustment. In this paper, we propose to formulate the Structure from Motion (SfM) problem inside a probabilistic diffusion framework, modelling the conditional distribution of camera poses given input images. This novel view of an old problem has several advantages. (i) The nature of the diffusion framework mirrors the iterative procedure of bundle adjustment. (ii) The formulation allows a seamless integration of geometric constraints from epipolar geometry. (iii) It excels in typically difficult scenarios such as sparse views with wide baselines. (iv) The method can predict intrinsics and extrinsics for an arbitrary amount of images. We demonstrate that our method PoseDiffusion significantly improves over the classic SfM pipelines and the learned approaches on two real-world datasets. Finally, it is observed that our method can generalize across datasets without further training. Project page: https://posediffusion.github.io/

PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment

TL;DR

PoseDiffusion reframes camera pose estimation as sampling from a learned conditional distribution using a diffusion model, enabling joint intrinsic and extrinsic estimation for arbitrary image sets. The method integrates a Transformer-based denoiser with diffusion steps and geometry-guided sampling via Sampson epipolar constraints to iteratively refine camera parameters. It achieves state-of-the-art results on CO3Dv2 and RealEstate10k and shows strong generalization across datasets, including cross-domain transfer to RealEstate10k. The approach also improves downstream novel-view synthesis when used to provide camera parameters for NeRF training, highlighting its practical impact for 3D reconstruction and view synthesis workflows.

Abstract

Camera pose estimation is a long-standing computer vision problem that to date often relies on classical methods, such as handcrafted keypoint matching, RANSAC and bundle adjustment. In this paper, we propose to formulate the Structure from Motion (SfM) problem inside a probabilistic diffusion framework, modelling the conditional distribution of camera poses given input images. This novel view of an old problem has several advantages. (i) The nature of the diffusion framework mirrors the iterative procedure of bundle adjustment. (ii) The formulation allows a seamless integration of geometric constraints from epipolar geometry. (iii) It excels in typically difficult scenarios such as sparse views with wide baselines. (iv) The method can predict intrinsics and extrinsics for an arbitrary amount of images. We demonstrate that our method PoseDiffusion significantly improves over the classic SfM pipelines and the learned approaches on two real-world datasets. Finally, it is observed that our method can generalize across datasets without further training. Project page: https://posediffusion.github.io/
Paper Structure (48 sections, 10 equations, 11 figures, 6 tables)

This paper contains 48 sections, 10 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Camera Pose Estimation with PoseDiffusion. We present a method to predict the camera parameters (extrinsics and intriniscs) for a given collection of scene images. Our model combines the strengths of traditional epipolar constraints from point correspondences with the power of diffusion models to iteratively refine an initially random set of poses.
  • Figure 2: PoseDiffusion overview. Training is supervised given a multi-view dataset of images and camera poses to learn a diffusion model $D_\theta$ to model $p(x|\mathtt{I})$. During inference the reverse diffusion process is guided through the gradient that minimizes the Sampson Epipolar Error between image pairs, optimizing geometric consistency between poses.
  • Figure 3: Inference. For each step $t$, Geometry-Guided Sampling (GGS) samples the distribution $p_\theta(x_{t-1} \mid x_t, \mathtt{I}, t)$ of refined cameras $x_{t-1}$ conditioned on input images $\mathtt{I}$ and the previous estimate $x_t$, while being guided by the gradient of the Sampson matching density $p(\mathtt{I}|x)$.
  • Figure 4: Pose estimation on CO3Dv2. Estimated cameras given input images $\mathtt{I}$ (first row). Our PoseDiffusion(2nd row) is compared to RelPose(3rd row), COLMAP+SPSG(4th row), and the ground truth. Missing cameras indicate failure.
  • Figure 5: Pose estimation accuracy on CO3Dv2. Metrics $\mathrm{RRA}@\tau,\mathrm{RTA}@\tau$ at different thresholds $\tau$ and $\mathrm{mAA}(30)$ ($y$-axes, higher-better) as a function of the number of input frames ($x$-axes). RelPose does not predict camera translation and hence is omitted in the respective figures.
  • ...and 6 more figures