Table of Contents
Fetching ...

Zero123-6D: Zero-shot Novel View Synthesis for RGB Category-level 6D Pose Estimation

Francesco Di Felice, Alberto Remus, Stefano Gasperini, Benjamin Busam, Lionel Ott, Federico Tombari, Roland Siegwart, Carlo Alberto Avizzano

TL;DR

This work presents Zero123-6D, the first work to demonstrate the utility of Diffusion Model-based novel-view-synthesizers in enhancing RGB 6D pose estimation at category-level, by integrating them with feature extraction techniques.

Abstract

Estimating the pose of objects through vision is essential to make robotic platforms interact with the environment. Yet, it presents many challenges, often related to the lack of flexibility and generalizability of state-of-the-art solutions. Diffusion models are a cutting-edge neural architecture transforming 2D and 3D computer vision, outlining remarkable performances in zero-shot novel-view synthesis. Such a use case is particularly intriguing for reconstructing 3D objects. However, localizing objects in unstructured environments is rather unexplored. To this end, this work presents Zero123-6D, the first work to demonstrate the utility of Diffusion Model-based novel-view-synthesizers in enhancing RGB 6D pose estimation at category-level, by integrating them with feature extraction techniques. Novel View Synthesis allows to obtain a coarse pose that is refined through an online optimization method introduced in this work to deal with intra-category geometric differences. In such a way, the outlined method shows reduction in data requirements, removal of the necessity of depth information in zero-shot category-level 6D pose estimation task, and increased performance, quantitatively demonstrated through experiments on the CO3D dataset.

Zero123-6D: Zero-shot Novel View Synthesis for RGB Category-level 6D Pose Estimation

TL;DR

This work presents Zero123-6D, the first work to demonstrate the utility of Diffusion Model-based novel-view-synthesizers in enhancing RGB 6D pose estimation at category-level, by integrating them with feature extraction techniques.

Abstract

Estimating the pose of objects through vision is essential to make robotic platforms interact with the environment. Yet, it presents many challenges, often related to the lack of flexibility and generalizability of state-of-the-art solutions. Diffusion models are a cutting-edge neural architecture transforming 2D and 3D computer vision, outlining remarkable performances in zero-shot novel-view synthesis. Such a use case is particularly intriguing for reconstructing 3D objects. However, localizing objects in unstructured environments is rather unexplored. To this end, this work presents Zero123-6D, the first work to demonstrate the utility of Diffusion Model-based novel-view-synthesizers in enhancing RGB 6D pose estimation at category-level, by integrating them with feature extraction techniques. Novel View Synthesis allows to obtain a coarse pose that is refined through an online optimization method introduced in this work to deal with intra-category geometric differences. In such a way, the outlined method shows reduction in data requirements, removal of the necessity of depth information in zero-shot category-level 6D pose estimation task, and increased performance, quantitatively demonstrated through experiments on the CO3D dataset.
Paper Structure (16 sections, 8 equations, 4 figures, 3 tables)

This paper contains 16 sections, 8 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Graphical overview of Zero123-6D. A set of $N$ reference views of instances belonging to a category (top left) is expanded with a novel-view synthesizer (mid). The view that best semantically matches with the query input (bottom left) is selected, while A 3D CAD model is reconstructed from all generated images and their poses. Finally, 2D-3D correspondences are established to refine the best view's estimated pose (bottom right).
  • Figure 2: Given a set of $N$ RGB reference views an RGB query image belonging to the same category (a), the goal is to find the 6D pose of the query object. The reference best views are fed to a novel-view-synthesizer diffusion model to generate novel RGB views from coarses poses (b). From that, query and generated views are semantically compared using DINO features, so the reference view that best semantically matches the query is selected, providing a set of 2D correspondences between the 2 views, and a coarse 6D pose (c). At the same time, all the posed generated views of the object are used to reconstruct a 3D mesh using a neural surface reconstructor (d) to obtain the 3D points corresponding to 2D reference matching points. Ultimately, an online optimization process is used between the 2D query points and 3D correspondent reference points to obtain a final refined pose (e).
  • Figure 3: Qualitative results of Zero123-6D on CO3D dataset reizenstein2021common and corresponding feature maps highlighted with PCA at the three channels.
  • Figure 4: Qualitative results of Zero123-6D on the Objectron dataset ahmadyan2021objectron. From left to right: the query RGB image, a reference example, the closest reference view with matched feature points, and finally, the lifted RGB point cloud of the reference in the estimated 6D pose of the query (only for visualization). The reference objects are from CO3D reizenstein2021common.