Table of Contents
Fetching ...

Extreme Two-View Geometry From Object Poses with Diffusion Models

Yujing Sun, Caiyi Sun, Yuan Liu, Yuexin Ma, Siu Ming Yiu

TL;DR

The paper tackles the problem of estimating relative camera pose $[\mathbf{R}^{(12)};\mathbf{t}^{(12)}]$ between two views with extreme viewpoint changes by leveraging diffusion-model object priors. It reformulates pose estimation as object pose estimation in a canonical object coordinate, uses diffusion-generated novel views from Zero123, and matches the second image to this set with a refinement step to recover the two-view pose. Key components include object-centric view construction via a homography $\mathbf{H}=\mathbf{K}_v\mathbf{R}_v\mathbf{K}^{-1}$, canonical-object-coordinates $\mathbf{x}_o$, and a two-stage pose estimation with viewpoint selection and a 3D feature-volume refiner. Experimental results on synthetic GSO and real Navi datasets show strong generalization and outperformance of baselines, with demonstrated benefits for Visual Odometry applications.

Abstract

Human has an incredible ability to effortlessly perceive the viewpoint difference between two images containing the same object, even when the viewpoint change is astonishingly vast with no co-visible regions in the images. This remarkable skill, however, has proven to be a challenge for existing camera pose estimation methods, which often fail when faced with large viewpoint differences due to the lack of overlapping local features for matching. In this paper, we aim to effectively harness the power of object priors to accurately determine two-view geometry in the face of extreme viewpoint changes. In our method, we first mathematically transform the relative camera pose estimation problem to an object pose estimation problem. Then, to estimate the object pose, we utilize the object priors learned from a diffusion model Zero123 to synthesize novel-view images of the object. The novel-view images are matched to determine the object pose and thus the two-view camera pose. In experiments, our method has demonstrated extraordinary robustness and resilience to large viewpoint changes, consistently estimating two-view poses with exceptional generalization ability across both synthetic and real-world datasets. Code will be available at https://github.com/scy639/Extreme-Two-View-Geometry-From-Object-Poses-with-Diffusion-Models.

Extreme Two-View Geometry From Object Poses with Diffusion Models

TL;DR

The paper tackles the problem of estimating relative camera pose between two views with extreme viewpoint changes by leveraging diffusion-model object priors. It reformulates pose estimation as object pose estimation in a canonical object coordinate, uses diffusion-generated novel views from Zero123, and matches the second image to this set with a refinement step to recover the two-view pose. Key components include object-centric view construction via a homography , canonical-object-coordinates , and a two-stage pose estimation with viewpoint selection and a 3D feature-volume refiner. Experimental results on synthetic GSO and real Navi datasets show strong generalization and outperformance of baselines, with demonstrated benefits for Visual Odometry applications.

Abstract

Human has an incredible ability to effortlessly perceive the viewpoint difference between two images containing the same object, even when the viewpoint change is astonishingly vast with no co-visible regions in the images. This remarkable skill, however, has proven to be a challenge for existing camera pose estimation methods, which often fail when faced with large viewpoint differences due to the lack of overlapping local features for matching. In this paper, we aim to effectively harness the power of object priors to accurately determine two-view geometry in the face of extreme viewpoint changes. In our method, we first mathematically transform the relative camera pose estimation problem to an object pose estimation problem. Then, to estimate the object pose, we utilize the object priors learned from a diffusion model Zero123 to synthesize novel-view images of the object. The novel-view images are matched to determine the object pose and thus the two-view camera pose. In experiments, our method has demonstrated extraordinary robustness and resilience to large viewpoint changes, consistently estimating two-view poses with exceptional generalization ability across both synthetic and real-world datasets. Code will be available at https://github.com/scy639/Extreme-Two-View-Geometry-From-Object-Poses-with-Diffusion-Models.
Paper Structure (21 sections, 6 equations, 14 figures, 6 tables)

This paper contains 21 sections, 6 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Our method can accurately infer the extreme relative camera pose of two images containing a co-visible object even without any overlap regions for correspondence estimation. Our method is based on a diffusion generative model to hallucinate the unseen sides of the object and match the hallucinated images with query images to estimate relative camera poses. The estimated extreme camera poses can be used in downstream applications, e.g. visual odometry.
  • Figure 2: Challenges in applying the object prior from diffusion models, e.g. Zero123 zero123, to relative pose estimation. On one hand, input images may not look at the object while Zero123 and common object pose estimations require the object to be located at the image center. On the other hand, Zero123 implicitly defines a canonical object coordinate inside, which brings difficulty in aligning the input images to this implicit canonical coordinate object system.
  • Figure 3: The overview of our pipeline.
  • Figure 4: Transformation to object-centric images. It happens when an input image is not looking at the target object. Hence, we transform the image so that it looks at the center of the target object by a homography transformation, which leads to a new pose and a new intrinsic matrix.
  • Figure 5: Zero123 zero123 is able to rotate an given image by a given $\Delta$azimuth and a $\Delta$elevation in the object canonical coordinate. However, Zero123 assumes that the $Y$+ direction (UP) of the image is aligned with gravity direction. Moreover, given the same rotation angle ($\Delta$azimuth and $\Delta$elevation), the actual rotated angle is related to the elevation angle of the input image. This requires us to estimate the inplane rotation and the canonical azimuth of the input image.
  • ...and 9 more figures