Table of Contents
Fetching ...

6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation

Li Xu, Haoxuan Qu, Yujun Cai, Jun Liu

TL;DR

This work tackles RGB-based 6D object pose estimation under occlusion and clutter by reframing 2D keypoint detection as a reverse diffusion problem. The authors introduce 6D-Diff, which uses a Mixture-of-Cauchy forward process to model heatmap-derived keypoint distributions and conditions the reverse diffusion on object appearance features, enabling robust 2D-3D correspondences for PnP pose estimation. Through a two-stage training regime and a transformer-based diffusion model, the approach achieves state-of-the-art results on LM-O and YCB-V, with ablations validating the benefits of denoising, MoC priors, and appearance conditioning. Overall, the diffusion-based distribution-transformation paradigm demonstrated here enhances pose estimation robustness in cluttered scenes and offers a principled framework for incorporating priors from intermediate representations into diffusion-based estimation.

Abstract

Estimating the 6D object pose from a single RGB image often involves noise and indeterminacy due to challenges such as occlusions and cluttered backgrounds. Meanwhile, diffusion models have shown appealing performance in generating high-quality images from random noise with high indeterminacy through step-by-step denoising. Inspired by their denoising capability, we propose a novel diffusion-based framework (6D-Diff) to handle the noise and indeterminacy in object pose estimation for better performance. In our framework, to establish accurate 2D-3D correspondence, we formulate 2D keypoints detection as a reverse diffusion (denoising) process. To facilitate such a denoising process, we design a Mixture-of-Cauchy-based forward diffusion process and condition the reverse process on the object features. Extensive experiments on the LM-O and YCB-V datasets demonstrate the effectiveness of our framework.

6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation

TL;DR

This work tackles RGB-based 6D object pose estimation under occlusion and clutter by reframing 2D keypoint detection as a reverse diffusion problem. The authors introduce 6D-Diff, which uses a Mixture-of-Cauchy forward process to model heatmap-derived keypoint distributions and conditions the reverse diffusion on object appearance features, enabling robust 2D-3D correspondences for PnP pose estimation. Through a two-stage training regime and a transformer-based diffusion model, the approach achieves state-of-the-art results on LM-O and YCB-V, with ablations validating the benefits of denoising, MoC priors, and appearance conditioning. Overall, the diffusion-based distribution-transformation paradigm demonstrated here enhances pose estimation robustness in cluttered scenes and offers a principled framework for incorporating priors from intermediate representations into diffusion-based estimation.

Abstract

Estimating the 6D object pose from a single RGB image often involves noise and indeterminacy due to challenges such as occlusions and cluttered backgrounds. Meanwhile, diffusion models have shown appealing performance in generating high-quality images from random noise with high indeterminacy through step-by-step denoising. Inspired by their denoising capability, we propose a novel diffusion-based framework (6D-Diff) to handle the noise and indeterminacy in object pose estimation for better performance. In our framework, to establish accurate 2D-3D correspondence, we formulate 2D keypoints detection as a reverse diffusion (denoising) process. To facilitate such a denoising process, we design a Mixture-of-Cauchy-based forward diffusion process and condition the reverse process on the object features. Extensive experiments on the LM-O and YCB-V datasets demonstrate the effectiveness of our framework.
Paper Structure (13 sections, 8 equations, 5 figures, 5 tables)

This paper contains 13 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of our proposed 6D-Diff framework. As shown, given the 3D keypoints from the object 3D CAD model, we aim to detect the corresponding 2D keypoints in the image to obtain the 6D object pose. Note that when detecting keypoints, there are often challenges such as occlusions (including self-occlusions) and cluttered backgrounds that can introduce noise and indeterminacy into the process, impacting the accuracy of pose prediction.
  • Figure 2: Above we show two examples of keypoint heatmaps, which serve as the intermediate representation chen2020endpeng2019pvnetcastro2023crt in our framework. The red dots indicate the ground-truth locations of the keypoints. In the example (a), the target object is the pink cat, which is heavily occluded in the image and is shown in a different pose compared to the 3D model. As shown above, due to occlusions and cluttered backgrounds, the keypoint heatmaps are noisy, which reflects the noise and indeterminacy during the keypoints detection process.
  • Figure 3: Illustration of our framework. During testing, given an input image, we first crop the Region of Interest (ROI) from the image through an object detector. After that, we feed the cropped ROI to the keypoints distribution initializer to obtain the heatmaps that can provide useful distribution priors about keypoints, to initialize $D_K$. Meanwhile, we can obtain object appearance features $f_{\text{app}}$. Next, we pass $f_{\text{app}}$ into the encoder, and the output of the encoder will serve as conditional information to aid the reverse process in the decoder. We sample $M$ sets of 2D keypoints coordinates from $D_K$, and feed these $M$ sets of coordinates into the decoder to perform the reverse process iteratively together with the step embedding $f^k_D$. At the final reverse step ($K$-th step), we average $\{d^i_0\}^{M}_{i=1}$ as the final keypoints coordinates prediction $d_0$, and use $d_0$ to compute the 6D pose with the pre-selected 3D keypoints via a PnP solver.
  • Figure 4: Visualization of the denoising process of a sample with our framework. In this example, the target object is the yellow duck and for clarity, we here show three keypoints only. The red dots indicate the ground-truth locations of these three keypoints. The noisy heatmap before denoising reflects that factors like occlusions and clutter in the scene can introduce noise and indeterminacy when detecting keypoints. As shown, our diffusion model can effectively and smoothly reduce the noise and indeterminacy in the initial distribution step by step, finally recovering a high-quality and determinate distribution of keypoints coordinates. (Better viewed in color)
  • Figure 5: Qualitative results. Green bounding boxes represent the ground-truth poses and blue bounding boxes represent the predicted poses of our method. As shown, even facing severe occlusions, clutter in the scene or varying environment, our framework can still accurately recover the object poses, showing the effectiveness of our method for handling the noise and indeterminacy caused by various factors in object pose estimation.