Table of Contents
Fetching ...

KeyMatchNet: Zero-Shot Pose Estimation in 3D Point Clouds by Generalized Keypoint Matching

Frederik Hagelskjær, Rasmus Laurvig Haugaard

TL;DR

The paper tackles zero-shot pose estimation for 3D point clouds using depth-only data, addressing industrial scenarios where color information is scarce. It introduces KeyMatchNet, a dual-branch network that computes object- and scene-features in parallel and matches object keypoints to scene points, followed by Kabsch-RANSAC for pose estimation. The authors show that precomputing object features and GPU-accelerated RANSAC yield fast runtimes while maintaining reasonable accuracy, and they validate the approach on synthetic, out-of-class, and real data, including a dataset of 1,500 CAD models in homogeneous-bin scenarios. Results indicate strong generalization to unseen objects and competitive performance with RGB-based methods when color is unavailable. This work may enable practical, low-data, zero-training deployment for industrial bin-picking and similar tasks.

Abstract

In this paper, we present KeyMatchNet, a novel network for zero-shot pose estimation in 3D point clouds. Our method uses only depth information, making it more applicable for many industrial use cases, as color information is seldom available. The network is composed of two parallel components for computing object and scene features. The features are then combined to create matches used for pose estimation. The parallel structure allows for pre-processing of the individual parts, which decreases the run-time. Using a zero-shot network allows for a very short set-up time, as it is not necessary to train models for new objects. However, as the network is not trained for the specific object, zero-shot pose estimation methods generally have lower accuracy compared with conventional methods. To address this, we reduce the complexity of the task by including the scenario information during training. This is typically not feasible as collecting real data for new tasks drastically increases the cost. However, for zero-shot pose estimation, training for new objects is not necessary and the expensive data collection can thus be performed only once. Our method is trained on 1,500 objects and is only tested on unseen objects. We demonstrate that the trained network can not only accurately estimate poses for novel objects, but also demonstrate the ability of the network on objects outside of the trained class. Test results are also shown on real data. We believe that the presented method is valuable for many real-world scenarios. Project page available at keymatchnet.github.io

KeyMatchNet: Zero-Shot Pose Estimation in 3D Point Clouds by Generalized Keypoint Matching

TL;DR

The paper tackles zero-shot pose estimation for 3D point clouds using depth-only data, addressing industrial scenarios where color information is scarce. It introduces KeyMatchNet, a dual-branch network that computes object- and scene-features in parallel and matches object keypoints to scene points, followed by Kabsch-RANSAC for pose estimation. The authors show that precomputing object features and GPU-accelerated RANSAC yield fast runtimes while maintaining reasonable accuracy, and they validate the approach on synthetic, out-of-class, and real data, including a dataset of 1,500 CAD models in homogeneous-bin scenarios. Results indicate strong generalization to unseen objects and competitive performance with RGB-based methods when color is unavailable. This work may enable practical, low-data, zero-training deployment for industrial bin-picking and similar tasks.

Abstract

In this paper, we present KeyMatchNet, a novel network for zero-shot pose estimation in 3D point clouds. Our method uses only depth information, making it more applicable for many industrial use cases, as color information is seldom available. The network is composed of two parallel components for computing object and scene features. The features are then combined to create matches used for pose estimation. The parallel structure allows for pre-processing of the individual parts, which decreases the run-time. Using a zero-shot network allows for a very short set-up time, as it is not necessary to train models for new objects. However, as the network is not trained for the specific object, zero-shot pose estimation methods generally have lower accuracy compared with conventional methods. To address this, we reduce the complexity of the task by including the scenario information during training. This is typically not feasible as collecting real data for new tasks drastically increases the cost. However, for zero-shot pose estimation, training for new objects is not necessary and the expensive data collection can thus be performed only once. Our method is trained on 1,500 objects and is only tested on unseen objects. We demonstrate that the trained network can not only accurately estimate poses for novel objects, but also demonstrate the ability of the network on objects outside of the trained class. Test results are also shown on real data. We believe that the presented method is valuable for many real-world scenarios. Project page available at keymatchnet.github.io
Paper Structure (17 sections, 6 figures, 5 tables)

This paper contains 17 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of the pose estimation method on an unseen object. The input to the network is a scene point cloud and an object point cloud. The object keypoints are visualized by the different color segments. The network output is both instance segmentation and keypoint predictions, which are combined to provide predictions only for the object. Finally these predictions are used in RANSAC fischler1981random for pose estimation. The striped keypoint prediction pattern is the result of the objects' rotational symmetry.
  • Figure 2: The network structure of the developed method. Bold text indicates the matrix size, where N is the number points in the scene, M is the number of points in the object, K is the number of keypoints, and F is the feature size at each point. The input point clouds consist of the combined xyz and normal vector information and are thus size 6. Object and scene features are computed independently, which allow for precomputed object features. The Segmentation and Keypoint Prediction outputs are shown along with the combination into matches.
  • Figure 3: Examples of the training data. Colors are only for visualization.
  • Figure 4: Top: The seven electronic components in the test dataset. Bottom: The seven components from the WRS dataset used for out of class tests.
  • Figure 5: Visualization of network output and resulting pose estimation using the RANSAC from Open3D Zhou2018. The prediction accuracy is shown as follows: "White" represents correctly predicted background, "red" false negatives and "yellow" false positives. "Blue" is correctly predicted segmentation of the object, but wrong keypoints and "green" is both correct segmentation and keypoint prediction. It is seen that segmentation prediction is very high, and while the keypoint accuracy is not very high the pose estimation are correct.
  • ...and 1 more figures