Table of Contents
Fetching ...

Pseudo-keypoint RKHS Learning for Self-supervised 6DoF Pose Estimation

Yangzheng Wu, Michael Greenspan

TL;DR

RKHSPose tackles the sim2real gap in 6DoF pose estimation by combining a self-supervised, keypoint-voting framework with a learnable RKHS Adapter. A main regressor produces radial keypoint votes, while the Adapter maps synthetic and real feature spaces into a shared RKHS and minimizes the domain distance via a trainable kernel, guided by pseudo-keypoints from real images. The approach delivers state-of-the-art results among self-supervised methods on LM, LMO, and YCB-Video, and remains competitive with fully supervised methods across six BOP core datasets, using only unlabeled real data. With a runtime around 34 fps and a targeted reduction in real-label requirements, this method offers practical, scalable improvements for 6DoF PE in real-world robotics and vision systems.

Abstract

We address the simulation-to-real domain gap in six degree-of-freedom pose estimation (6DoF PE), and propose a novel self-supervised keypoint voting-based 6DoF PE framework, effectively narrowing this gap using a learnable kernel in RKHS. We formulate this domain gap as a distance in high-dimensional feature space, distinct from previous iterative matching methods. We propose an adapter network, which is pre-trained on purely synthetic data with synthetic ground truth poses, and which evolves the network parameters from this source synthetic domain to the target real domain. Importantly, the real data training only uses pseudo-poses estimated by pseudo-keypoints, and thereby requires no real ground truth data annotations. Our proposed method is called RKHSPose, and achieves state-of-the-art performance among self-supervised methods on three commonly used 6DoF PE datasets including LINEMOD (+4.2%), Occlusion LINEMOD (+2%), and YCB-Video (+3%). It also compares favorably to fully supervised methods on all six applicable BOP core datasets, achieving within -11.3% to +0.2% of the top fully supervised results.

Pseudo-keypoint RKHS Learning for Self-supervised 6DoF Pose Estimation

TL;DR

RKHSPose tackles the sim2real gap in 6DoF pose estimation by combining a self-supervised, keypoint-voting framework with a learnable RKHS Adapter. A main regressor produces radial keypoint votes, while the Adapter maps synthetic and real feature spaces into a shared RKHS and minimizes the domain distance via a trainable kernel, guided by pseudo-keypoints from real images. The approach delivers state-of-the-art results among self-supervised methods on LM, LMO, and YCB-Video, and remains competitive with fully supervised methods across six BOP core datasets, using only unlabeled real data. With a runtime around 34 fps and a targeted reduction in real-label requirements, this method offers practical, scalable improvements for 6DoF PE in real-world robotics and vision systems.

Abstract

We address the simulation-to-real domain gap in six degree-of-freedom pose estimation (6DoF PE), and propose a novel self-supervised keypoint voting-based 6DoF PE framework, effectively narrowing this gap using a learnable kernel in RKHS. We formulate this domain gap as a distance in high-dimensional feature space, distinct from previous iterative matching methods. We propose an adapter network, which is pre-trained on purely synthetic data with synthetic ground truth poses, and which evolves the network parameters from this source synthetic domain to the target real domain. Importantly, the real data training only uses pseudo-poses estimated by pseudo-keypoints, and thereby requires no real ground truth data annotations. Our proposed method is called RKHSPose, and achieves state-of-the-art performance among self-supervised methods on three commonly used 6DoF PE datasets including LINEMOD (+4.2%), Occlusion LINEMOD (+2%), and YCB-Video (+3%). It also compares favorably to fully supervised methods on all six applicable BOP core datasets, achieving within -11.3% to +0.2% of the top fully supervised results.
Paper Structure (23 sections, 6 equations, 8 figures, 14 tables)

This paper contains 23 sections, 6 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: RKHSPose adapts the network pretrained on synthetic data to real test scenes (left), by comparing network feature spaces with real image inputs (solid arrows), against those with syn/real image (right) inputs (dashed arrows). $M_r$ regresses radial quantities, $M_A$ is the Adapter network, and RKHS maps features into a higher dimensional space.
  • Figure 2: RKHSPose architecture. RKHPose is first trained on synthetic labeled data (solid arrows), and then finetuned on alternating syn/real and (unlabeled) real images (dashed arrows). $M_A$ is measured by MMD in RKHS by densely mapping the intermediate features of $M_r$ into high dimensional spaces with conv blocks. The distance is treated as $\mathcal{L}_{M_A}$ and back-propagated through $M_A$ and $M_r$.
  • Figure 3: Qualitative overlay results on selected images. Red dots and blue dots are projected surface points from GT poses and estimated poses, respectively.
  • Figure 4: Impact of # of real images with/without GT labels used during training. All datasets are evaluated by the BOP AR metric. We conduct experiments from 0 to 640 real images on all datasets, except ITODD which contained only 357 real images.
  • Figure S.5: Convolutional RKHS Adapter $M_A$ detailed structure.
  • ...and 3 more figures