Table of Contents
Fetching ...

DiffusionNOCS: Managing Symmetry and Uncertainty in Sim2Real Multi-Modal Category-level Pose Estimation

Takuya Ikeda, Sergey Zakharov, Tianyi Ko, Muhammad Zubair Irshad, Robert Lee, Katherine Liu, Rares Ambrus, Koichi Nishiwaki

TL;DR

This work proposes a probabilistic model that relies on diffusion to estimate dense canonical maps crucial for recovering partial object shapes as well as establishing correspondences essential for pose estimation, and introduces critical components to enhance performance by leveraging the strength of the diffusion models with multi-modal input representations.

Abstract

This paper addresses the challenging problem of category-level pose estimation. Current state-of-the-art methods for this task face challenges when dealing with symmetric objects and when attempting to generalize to new environments solely through synthetic data training. In this work, we address these challenges by proposing a probabilistic model that relies on diffusion to estimate dense canonical maps crucial for recovering partial object shapes as well as establishing correspondences essential for pose estimation. Furthermore, we introduce critical components to enhance performance by leveraging the strength of the diffusion models with multi-modal input representations. We demonstrate the effectiveness of our method by testing it on a range of real datasets. Despite being trained solely on our generated synthetic data, our approach achieves state-of-the-art performance and unprecedented generalization qualities, outperforming baselines, even those specifically trained on the target domain.

DiffusionNOCS: Managing Symmetry and Uncertainty in Sim2Real Multi-Modal Category-level Pose Estimation

TL;DR

This work proposes a probabilistic model that relies on diffusion to estimate dense canonical maps crucial for recovering partial object shapes as well as establishing correspondences essential for pose estimation, and introduces critical components to enhance performance by leveraging the strength of the diffusion models with multi-modal input representations.

Abstract

This paper addresses the challenging problem of category-level pose estimation. Current state-of-the-art methods for this task face challenges when dealing with symmetric objects and when attempting to generalize to new environments solely through synthetic data training. In this work, we address these challenges by proposing a probabilistic model that relies on diffusion to estimate dense canonical maps crucial for recovering partial object shapes as well as establishing correspondences essential for pose estimation. Furthermore, we introduce critical components to enhance performance by leveraging the strength of the diffusion models with multi-modal input representations. We demonstrate the effectiveness of our method by testing it on a range of real datasets. Despite being trained solely on our generated synthetic data, our approach achieves state-of-the-art performance and unprecedented generalization qualities, outperforming baselines, even those specifically trained on the target domain.
Paper Structure (24 sections, 3 equations, 9 figures, 3 tables)

This paper contains 24 sections, 3 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Overview: Our method can estimate multiple possible poses via diffusion models with a single observation. The diffusion models can be conditioned on any image, and naturally handle the ambiguity present from symmetry, where the predictions are certain when the mug handle is visible in the input image and are uncertain when it is invisible. The projected Normalized Object Coordinate Space(NOCS) map is denoised by the diffusion model and used as dense correspondences to estimate the poses.
  • Figure 2: Pipeline Overview: To prepare input representations from RGB, depth and given 2D bounding box, the operations of masking, warping, resizing are conducted sequentially. For the segmentation, SAM kirillov2023segment with box prompts is used. To extract low-dimensional DINOv2 features with fixed shape, PCA and resizing are applied. Then, the diffusion model is conditioned by available inputs and estimates NOCS maps from noise images. Lastly, 6D pose and confidence are estimated via the point registration between denoised NOCS maps and partial point clouds from mask, depth, and given intrinsics Yang20tro-teaser.
  • Figure 3: Semantic Features: DINOv2 features via 3-dimensional PCA with input RGB to understand the consistency across the different instances.
  • Figure 4: Estimated NOCS maps from Different Conditions: As a generative model, our method is capable of recovering plausible NOCS maps even when provided with only noise image input, i.e. the top left example shows a NOCS map highly resembling the "laptop" object. Adding a category ID to the input limits the output distribution to the objects of a specified category (top right). When RGB images or surface normals are provided, the silhouettes of the resulting NOCS maps are aligned with the inputs. Moreover, the output NOCS faithfully preserves geometric information provided by the surface normals as can be seen from the bottom right example where a hollow bottle is reconstructed. In contrast, the filled bottle is recovered when RGB information alone is provided (bottom left).
  • Figure 5: Synthetic Data for DiffusionNOCS: A visualization of camera poses and generated synthetic images used to train DiffusionNOCS.
  • ...and 4 more figures