Table of Contents
Fetching ...

KGpose: Keypoint-Graph Driven End-to-End Multi-Object 6D Pose Estimation via Point-Wise Pose Voting

Andrew Jeong

TL;DR

KGpose tackles multi-object $6D$ pose estimation from RGB-D data in an end-to-end framework. It introduces a keypoint-graph representation and a point-wise pose voting scheme that regresses poses directly from graph-embedded keypoints via graph convolutions, enabling unified handling of multiple instances without explicit localization. On YCB-Video, KGpose delivers competitive ADD and ADD-S AUC metrics, demonstrating effective cross-modal fusion, attentive feature integration, and disentangled pose losses. The approach offers an efficient, differentiable pipeline for robotic manipulation in complex scenes and points to future work extending to more objects, outdoor environments, and self-supervised learning.

Abstract

This letter presents KGpose, a novel end-to-end framework for 6D pose estimation of multiple objects. Our approach combines keypoint-based method with learnable pose regression through `keypoint-graph', which is a graph representation of the keypoints. KGpose first estimates 3D keypoints for each object using an attentional multi-modal feature fusion of RGB and point cloud features. These keypoints are estimated from each point of point cloud and converted into a graph representation. The network directly regresses 6D pose parameters for each point through a sequence of keypoint-graph embedding and local graph embedding which are designed with graph convolutions, followed by rotation and translation heads. The final pose for each object is selected from the candidates of point-wise predictions. The method achieves competitive results on the benchmark dataset, demonstrating the effectiveness of our model. KGpose enables multi-object pose estimation without requiring an extra localization step, offering a unified and efficient solution for understanding geometric contexts in complex scenes for robotic applications.

KGpose: Keypoint-Graph Driven End-to-End Multi-Object 6D Pose Estimation via Point-Wise Pose Voting

TL;DR

KGpose tackles multi-object pose estimation from RGB-D data in an end-to-end framework. It introduces a keypoint-graph representation and a point-wise pose voting scheme that regresses poses directly from graph-embedded keypoints via graph convolutions, enabling unified handling of multiple instances without explicit localization. On YCB-Video, KGpose delivers competitive ADD and ADD-S AUC metrics, demonstrating effective cross-modal fusion, attentive feature integration, and disentangled pose losses. The approach offers an efficient, differentiable pipeline for robotic manipulation in complex scenes and points to future work extending to more objects, outdoor environments, and self-supervised learning.

Abstract

This letter presents KGpose, a novel end-to-end framework for 6D pose estimation of multiple objects. Our approach combines keypoint-based method with learnable pose regression through `keypoint-graph', which is a graph representation of the keypoints. KGpose first estimates 3D keypoints for each object using an attentional multi-modal feature fusion of RGB and point cloud features. These keypoints are estimated from each point of point cloud and converted into a graph representation. The network directly regresses 6D pose parameters for each point through a sequence of keypoint-graph embedding and local graph embedding which are designed with graph convolutions, followed by rotation and translation heads. The final pose for each object is selected from the candidates of point-wise predictions. The method achieves competitive results on the benchmark dataset, demonstrating the effectiveness of our model. KGpose enables multi-object pose estimation without requiring an extra localization step, offering a unified and efficient solution for understanding geometric contexts in complex scenes for robotic applications.
Paper Structure (19 sections, 10 equations, 5 figures, 1 table)

This paper contains 19 sections, 10 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Our approach. (a) Each point in input point cloud votes for 3D keypoints of each object in the scene. (b) Each set of the keypoints are converted into a graph, which is called 'keypoint-graph'. (c) 6D pose of an object is estimated (or voted) from each point through several layers of graph convolution.
  • Figure 2: Overview of KGpose. Given RGB-D image, 3D keypoints are estimated through RGB and point cloud branches along with feature fusion process. Then, the estimated keypoints and corresponding keypoints from object model are converted into a graph representation and concatenated. These graphs are passed through keypoint graph embedding and several layers of local graph embedding stages to embed the graph features into each node (or point). Finally the embedded point features are fed into rotation and translation head to regress 6D pose parameters for each point, which are candidates of 6D pose. The final 6D pose for each object is determined by selecting the nearest candidate to the mean of the candidates of the object.
  • Figure 3: Revised CBAM module for both feature fusion and skip connections.
  • Figure 4: Process of graph embedding. (a) Estimated keypoints are regarded as vertices and edges are defined as vectors from the keypoints to their center. Edge features from each graph are passed through Edge Convolution to embed the information about the keypoints to each point that voted for them. (b) The embedded point features construct local graphs about their k-NN and the edge features from the graphs are also fed into Edge Convolution layer to embed graph features (c) The point features are sampled and construct local graphs about k-NN to understand the features with a larger receptive field (d) The outcomes from (b) and (c) are fused to update the point features
  • Figure 5: Qualitative results on YCB-Video Dataset. Vertices of object model are transformed by estimated pose and projected to the 2D image to visualize the performance on 6D pose estimation. Our model works well but still has difficulty on heavily occcluded scenes (see the right figure).