Table of Contents
Fetching ...

GMatch: A Lightweight, Geometry-Constrained Keypoint Matcher for Zero-Shot 6DoF Pose Estimation in Robotic Grasp Tasks

Ming Yang, Haoran Li

TL;DR

GMatch tackles the challenge of zero-shot 6DoF pose estimation on resource-constrained robots by revisiting keypoint matching and introducing a geometry-constrained incremental matcher. It formulates correspondence as a branch-and-bound search that enforces geometric completeness via pairwise distances and scalar triple products, augmented with an opacity constraint to prevent flip-overs, and operates with a tunable feature-distance threshold $\epsilon_f$ and geometry tolerance $\epsilon_c$. Across HOPE and YCB-Video, GMatch coupled with SIFT achieves competitive accuracy, outperforming several feature-based and registration baselines and approaching state-of-the-art zero-shot methods on texture-rich objects, while running efficiently on CPU-only hardware. A real-world LoCoBot grasp demonstration validates its practicality, illustrating a lightweight, white-box solution that remains flexible to descriptor choice and potentially scalable with improved descriptors. The work highlights a practical direction for robust yet efficient pose estimation in embedded robotic systems, with future work targeting descriptor quality and additional geometric constraints to broaden robustness.

Abstract

6DoF object pose estimation is fundamental to robotic grasp tasks. While recent learning-based methods achieve high accuracy, their computational demands hinder deployment on resource-constrained mobile platforms. In this work, we revisit the classical keypoint matching paradigm and propose GMatch, a lightweight, geometry-constrained keypoint matcher that can run efficiently on embedded CPU-only platforms. GMatch works with keypoint descriptors and it uses a set of geometric constraints to establishes inherent ambiguities between features extracted by descriptors, thus giving a globally consistent correspondences from which 6DoF pose can be easily solved. We benchmark GMatch on the HOPE and YCB-Video datasets, where our method beats existing keypoint matchers (both feature-based and geometry-based) among three commonly used descriptors and approaches the SOTA zero-shot method on texture-rich objects with much more humble devices. The method is further deployed on a LoCoBot mobile manipulator, enabling a one-shot grasp pipeline that demonstrates high task success rates in real-world experiments. In a word, by its lightweight and white-box nature, GMatch offers a practical solution for resource-limited robotic systems, and although currently bottlenecked by descriptor quality, the framework presents a promising direction towards robust yet efficient pose estimation. Code will be released soon under Mozilla Public License.

GMatch: A Lightweight, Geometry-Constrained Keypoint Matcher for Zero-Shot 6DoF Pose Estimation in Robotic Grasp Tasks

TL;DR

GMatch tackles the challenge of zero-shot 6DoF pose estimation on resource-constrained robots by revisiting keypoint matching and introducing a geometry-constrained incremental matcher. It formulates correspondence as a branch-and-bound search that enforces geometric completeness via pairwise distances and scalar triple products, augmented with an opacity constraint to prevent flip-overs, and operates with a tunable feature-distance threshold and geometry tolerance . Across HOPE and YCB-Video, GMatch coupled with SIFT achieves competitive accuracy, outperforming several feature-based and registration baselines and approaching state-of-the-art zero-shot methods on texture-rich objects, while running efficiently on CPU-only hardware. A real-world LoCoBot grasp demonstration validates its practicality, illustrating a lightweight, white-box solution that remains flexible to descriptor choice and potentially scalable with improved descriptors. The work highlights a practical direction for robust yet efficient pose estimation in embedded robotic systems, with future work targeting descriptor quality and additional geometric constraints to broaden robustness.

Abstract

6DoF object pose estimation is fundamental to robotic grasp tasks. While recent learning-based methods achieve high accuracy, their computational demands hinder deployment on resource-constrained mobile platforms. In this work, we revisit the classical keypoint matching paradigm and propose GMatch, a lightweight, geometry-constrained keypoint matcher that can run efficiently on embedded CPU-only platforms. GMatch works with keypoint descriptors and it uses a set of geometric constraints to establishes inherent ambiguities between features extracted by descriptors, thus giving a globally consistent correspondences from which 6DoF pose can be easily solved. We benchmark GMatch on the HOPE and YCB-Video datasets, where our method beats existing keypoint matchers (both feature-based and geometry-based) among three commonly used descriptors and approaches the SOTA zero-shot method on texture-rich objects with much more humble devices. The method is further deployed on a LoCoBot mobile manipulator, enabling a one-shot grasp pipeline that demonstrates high task success rates in real-world experiments. In a word, by its lightweight and white-box nature, GMatch offers a practical solution for resource-limited robotic systems, and although currently bottlenecked by descriptor quality, the framework presents a promising direction towards robust yet efficient pose estimation. Code will be released soon under Mozilla Public License.

Paper Structure

This paper contains 19 sections, 2 theorems, 23 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Lemma 1

Given two ordered point sets $\{\mathbf{x}_i\}_{i=1}^n, \{\mathbf{y}_i\}_{i=1}^n \subset \mathbb{R}^3$ satisfying $\lVert \mathbf{x}_i - \mathbf{x}_j \rVert = \lVert \mathbf{y}_i - \mathbf{y}_j \rVert, \quad \forall i,j = 1,\dots,n$, there exists an orthogonal matrix $\mathbf{Q} \in \mathbb{R}^{3 \t

Figures (6)

  • Figure 1: Overview of the matching-based pose estimation pipeline. Given a set of RGB-D images (snapshots) rendered from target CAD model as the source and a scene image (observation) as the target, the descriptor processes them independently to generate keypoints and feature vectors, which are used to reason correspondences by keypoint matcher. Afterwards, Kabsch algorithm kabsch1978discussion or PnP gao2003p3p is used to solve the pose from 3D-3D or 2D-3D correspondences.
  • Figure 2: GMatch performs incremental search (Step) over hypothese generated by branch-and-bound stategy and select the matches with the max length as output. In the illustrated example with repetitive grape textures, three locally plausible candidate pairs are initially identified. GMatch filters out inconsistent pairs using geometric characteristics such as relative distance and scalar triple product, retaining only globally consistent correspondences.
  • Figure 3: Dense keypoints with alike features are extracted on flat or approximately flat text region, which yields many plausible matches that leads to flip-over.
  • Figure 4: Qualitative comparison: rotation sensitivity for SuperPoint and weak detection repeatability for ORB; inaccurate matches for LightGlue and redundant cross matching for TEASER++.
  • Figure 5: Failure cases of GMatch-SIFT. SIFT detects few keypoints (left) or indistinguishable features (right) on texture-weak objects. The former leaves GMatch no candidate pairs, and the latter often yields plenty of plausible solutions with lower cost than the real one.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Lemma 1: Satorras et al. satorras2021proof
  • proof
  • Proposition 1
  • proof
  • proof : Proof of Lemma \ref{['lemma:pairwise-distance']}
  • proof : Proof of Proposition \ref{['prop:equivalence']}