6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation
Li Xu, Haoxuan Qu, Yujun Cai, Jun Liu
TL;DR
This work tackles RGB-based 6D object pose estimation under occlusion and clutter by reframing 2D keypoint detection as a reverse diffusion problem. The authors introduce 6D-Diff, which uses a Mixture-of-Cauchy forward process to model heatmap-derived keypoint distributions and conditions the reverse diffusion on object appearance features, enabling robust 2D-3D correspondences for PnP pose estimation. Through a two-stage training regime and a transformer-based diffusion model, the approach achieves state-of-the-art results on LM-O and YCB-V, with ablations validating the benefits of denoising, MoC priors, and appearance conditioning. Overall, the diffusion-based distribution-transformation paradigm demonstrated here enhances pose estimation robustness in cluttered scenes and offers a principled framework for incorporating priors from intermediate representations into diffusion-based estimation.
Abstract
Estimating the 6D object pose from a single RGB image often involves noise and indeterminacy due to challenges such as occlusions and cluttered backgrounds. Meanwhile, diffusion models have shown appealing performance in generating high-quality images from random noise with high indeterminacy through step-by-step denoising. Inspired by their denoising capability, we propose a novel diffusion-based framework (6D-Diff) to handle the noise and indeterminacy in object pose estimation for better performance. In our framework, to establish accurate 2D-3D correspondence, we formulate 2D keypoints detection as a reverse diffusion (denoising) process. To facilitate such a denoising process, we design a Mixture-of-Cauchy-based forward diffusion process and condition the reverse process on the object features. Extensive experiments on the LM-O and YCB-V datasets demonstrate the effectiveness of our framework.
