KGpose: Keypoint-Graph Driven End-to-End Multi-Object 6D Pose Estimation via Point-Wise Pose Voting
Andrew Jeong
TL;DR
KGpose tackles multi-object $6D$ pose estimation from RGB-D data in an end-to-end framework. It introduces a keypoint-graph representation and a point-wise pose voting scheme that regresses poses directly from graph-embedded keypoints via graph convolutions, enabling unified handling of multiple instances without explicit localization. On YCB-Video, KGpose delivers competitive ADD and ADD-S AUC metrics, demonstrating effective cross-modal fusion, attentive feature integration, and disentangled pose losses. The approach offers an efficient, differentiable pipeline for robotic manipulation in complex scenes and points to future work extending to more objects, outdoor environments, and self-supervised learning.
Abstract
This letter presents KGpose, a novel end-to-end framework for 6D pose estimation of multiple objects. Our approach combines keypoint-based method with learnable pose regression through `keypoint-graph', which is a graph representation of the keypoints. KGpose first estimates 3D keypoints for each object using an attentional multi-modal feature fusion of RGB and point cloud features. These keypoints are estimated from each point of point cloud and converted into a graph representation. The network directly regresses 6D pose parameters for each point through a sequence of keypoint-graph embedding and local graph embedding which are designed with graph convolutions, followed by rotation and translation heads. The final pose for each object is selected from the candidates of point-wise predictions. The method achieves competitive results on the benchmark dataset, demonstrating the effectiveness of our model. KGpose enables multi-object pose estimation without requiring an extra localization step, offering a unified and efficient solution for understanding geometric contexts in complex scenes for robotic applications.
