Table of Contents
Fetching ...

Particle-based 6D Object Pose Estimation from Point Clouds using Diffusion Models

Christian Möller, Niklas Funk, Jan Peters

TL;DR

This work introduces a diffusion-based approach for 6D object pose estimation from a single depth view using point clouds. By embedding scene and object observations in an SE({3})-equivariant latent space via VectorNeurons, it generates multiple pose hypotheses and employs two lightweight, training-free particle-selection strategies to select a single pose. Partial rendering and SE({3})-equivariant latent representations are shown to improve accuracy and inference speed, enabling effective handling of occlusions and object symmetries on the Linemod dataset. The method achieves competitive accuracy with notable efficiency gains, particularly when rendering is performed every few iterations, making it practical for real-time or near-real-time deployment in 3D vision tasks.

Abstract

Object pose estimation from a single view remains a challenging problem. In particular, partial observability, occlusions, and object symmetries eventually result in pose ambiguity. To account for this multimodality, this work proposes training a diffusion-based generative model for 6D object pose estimation. During inference, the trained generative model allows for sampling multiple particles, i.e., pose hypotheses. To distill this information into a single pose estimate, we propose two novel and effective pose selection strategies that do not require any additional training or computationally intensive operations. Moreover, while many existing methods for pose estimation primarily focus on the image domain and only incorporate depth information for final pose refinement, our model solely operates on point cloud data. The model thereby leverages recent advancements in point cloud processing and operates upon an SE(3)-equivariant latent space that forms the basis for the particle selection strategies and allows for improved inference times. Our thorough experimental results demonstrate the competitive performance of our approach on the Linemod dataset and showcase the effectiveness of our design choices. Code is available at https://github.com/zitronian/6DPoseDiffusion .

Particle-based 6D Object Pose Estimation from Point Clouds using Diffusion Models

TL;DR

This work introduces a diffusion-based approach for 6D object pose estimation from a single depth view using point clouds. By embedding scene and object observations in an SE({3})-equivariant latent space via VectorNeurons, it generates multiple pose hypotheses and employs two lightweight, training-free particle-selection strategies to select a single pose. Partial rendering and SE({3})-equivariant latent representations are shown to improve accuracy and inference speed, enabling effective handling of occlusions and object symmetries on the Linemod dataset. The method achieves competitive accuracy with notable efficiency gains, particularly when rendering is performed every few iterations, making it practical for real-time or near-real-time deployment in 3D vision tasks.

Abstract

Object pose estimation from a single view remains a challenging problem. In particular, partial observability, occlusions, and object symmetries eventually result in pose ambiguity. To account for this multimodality, this work proposes training a diffusion-based generative model for 6D object pose estimation. During inference, the trained generative model allows for sampling multiple particles, i.e., pose hypotheses. To distill this information into a single pose estimate, we propose two novel and effective pose selection strategies that do not require any additional training or computationally intensive operations. Moreover, while many existing methods for pose estimation primarily focus on the image domain and only incorporate depth information for final pose refinement, our model solely operates on point cloud data. The model thereby leverages recent advancements in point cloud processing and operates upon an SE(3)-equivariant latent space that forms the basis for the particle selection strategies and allows for improved inference times. Our thorough experimental results demonstrate the competitive performance of our approach on the Linemod dataset and showcase the effectiveness of our design choices. Code is available at https://github.com/zitronian/6DPoseDiffusion .

Paper Structure

This paper contains 24 sections, 1 equation, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: architecture for learning 6D pose distributions. The object point cloud $\boldsymbol{x}_o$ is sampled from a 3D model of the object whose pose we want to determine in the scene. Given a specific pose $\boldsymbol{H} \in \mathrm{SE({3})}$ and its normal vectors $\boldsymbol{n}_o$ the object point cloud is partially rendered. Scene point cloud $\boldsymbol{x}_s$ and partial rendered object point cloud $\boldsymbol{x}_{o}^{s'}$ are embedded through a shared encoder that leverages . Finally, the scene latent $\boldsymbol{z}_s$ and object latent $\boldsymbol{z}_o$ are flattened and together with an encoded time step and the respective pose $\boldsymbol{H}$ fed through a Feature Encoder $F_{\theta}$ and Decoder $D_{\theta}$ to predict a score $\boldsymbol{s} \in \mathbb{R}^{6}$.
  • Figure 2: Objects of the Linemod dataset linemodhinterstoisser2011 excluding the two objects with incomplete 3D models. * denotes symmetric objects.
  • Figure 3: Accuracy Curve with AUC values for all selection methods for a model trained and tested on all objects. Single particle sampling, i.e. eliminating the need for any selection strategy yields the lower bound and selection by ground truth information the upper bound.
  • Figure 4: Heatmap showing the correlation between the true rank and the predicted rank of the 20 sampled particles. On the left, the predicted rank is based on the selection by score and on the right side on selection by latent.
  • Figure 5: Comparison of using the partially observable object model and using the full observable object mode. (a) is based on selection by score (\ref{['sec:selectionbyscore']}) whereas (b) is based on selection by latent (\ref{['sec:selectionbylatent']}). Both plots show the accuracy for each object.
  • ...and 7 more figures