Table of Contents
Fetching ...

Generalizable Single-view Object Pose Estimation by Two-side Generating and Matching

Yujing Sun, Caiyi Sun, Yuan Liu, Yuexin Ma, Siu Ming Yiu

TL;DR

A novel generalizable object pose estimation method to determine the object pose using only one RGB image, which operates with a single reference image of the object, and eliminates the need for 3D object models or multiple views of the object.

Abstract

In this paper, we present a novel generalizable object pose estimation method to determine the object pose using only one RGB image. Unlike traditional approaches that rely on instance-level object pose estimation and necessitate extensive training data, our method offers generalization to unseen objects without extensive training, operates with a single reference image of the object, and eliminates the need for 3D object models or multiple views of the object. These characteristics are achieved by utilizing a diffusion model to generate novel-view images and conducting a two-sided matching on these generated images. Quantitative experiments demonstrate the superiority of our method over existing pose estimation techniques across both synthetic and real-world datasets. Remarkably, our approach maintains strong performance even in scenarios with significant viewpoint changes, highlighting its robustness and versatility in challenging conditions. The code will be re leased at https://github.com/scy639/Gen2SM.

Generalizable Single-view Object Pose Estimation by Two-side Generating and Matching

TL;DR

A novel generalizable object pose estimation method to determine the object pose using only one RGB image, which operates with a single reference image of the object, and eliminates the need for 3D object models or multiple views of the object.

Abstract

In this paper, we present a novel generalizable object pose estimation method to determine the object pose using only one RGB image. Unlike traditional approaches that rely on instance-level object pose estimation and necessitate extensive training data, our method offers generalization to unseen objects without extensive training, operates with a single reference image of the object, and eliminates the need for 3D object models or multiple views of the object. These characteristics are achieved by utilizing a diffusion model to generate novel-view images and conducting a two-sided matching on these generated images. Quantitative experiments demonstrate the superiority of our method over existing pose estimation techniques across both synthetic and real-world datasets. Remarkably, our approach maintains strong performance even in scenarios with significant viewpoint changes, highlighting its robustness and versatility in challenging conditions. The code will be re leased at https://github.com/scy639/Gen2SM.

Paper Structure

This paper contains 31 sections, 6 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Our method excels at accurately estimating object poses with a single reference image. Importantly, it maintains its accuracy even when faced with query images exhibiting substantial viewpoint changes from the reference image. The object poses estimated by our method offer significant value for a variety of practical applications such as 3D reconstruction.
  • Figure 2: (a) Utilizing Zero123 liu2023zero123 to directly generate an image from the viewpoint of the query image $I_q$ based on the reference image $I_r$. Due to the significant viewpoint change, the generated image exhibits low quality, hindering accurate matching with the query image. (b) Instead, we leverage Zero123 liu2023zero123 to generate images from intermediate viewpoints between the query image $I_q$ and the reference image $I_r$. These generated images at intermediate viewpoints show better alignment, facilitating the estimation of an accurate relative pose between $I_r$ and $I_q$.
  • Figure 3: The overview of our method. Given a reference image $I_r$ containing an object, our method is able to estimate the object pose of a query image $I_q$ containing the same object. We firstly utilize a pre-trained diffusion model to generate novel-view images. Then we estimate the elevation $\theta_q$ and azimuth $\phi_q$ of $I_q$ by minimizing the two-side matching loss, which will be explained in Sec \ref{['sec:two_side_matching']}.
  • Figure 4: We generate images $\mathbb{G}(z, I_r, \Delta \theta_{ri}, \Delta \phi_{ri})$ of $I_r$ and $\mathbb{G}(z, I_q, \Delta \theta_{qi}, \Delta \phi_{qi})$ of $I_q$ on $N$ intermediate viewpoints $\{\phi_i,\theta_i | i = 1,...,N\}$, which are sampled evenly from the upper hemisphere of the object. When the assumed $\phi_q,\theta_q$ is correct, for each $i$, $\mathbb{G}(z, I_q, \Delta \theta_{qi}, \Delta \phi_{qi})$ and $\mathbb{G}(z, I_r, \Delta \theta_{ri}, \Delta \phi_{ri})$ are well-matched.
  • Figure 5: As shown in Problem \ref{['eq:set2set_argmin_SDS']}, we approximate the original two-side matching problem by minimizing the loss function proposed by Poole et al. poole2022dreamfusion. For a generated image set $\mathcal{G}_r=\{I_{r\to i} | i=1,2,...,N\} = \{\mathbb{G}(z, I_r, \Delta \theta_{ri}, \Delta \phi_{ri}) | i=1,2,...,N\}$, we add noise to $I_{r\to i}$ to obtain $I_{r\to i,t}$ as Eq. (\ref{['eq:forward']}). Then, we employ the denoiser $\epsilon_\Theta$ of Zero123 to get the predicted noise $\epsilon_\Theta(I_{r\to i,t}|I_q,\Delta \theta_{qi}, \Delta \phi_{qi})$ and use the $L_2$ loss as the distance between the generated images.
  • ...and 3 more figures