Pseudo-keypoint RKHS Learning for Self-supervised 6DoF Pose Estimation
Yangzheng Wu, Michael Greenspan
TL;DR
RKHSPose tackles the sim2real gap in 6DoF pose estimation by combining a self-supervised, keypoint-voting framework with a learnable RKHS Adapter. A main regressor produces radial keypoint votes, while the Adapter maps synthetic and real feature spaces into a shared RKHS and minimizes the domain distance via a trainable kernel, guided by pseudo-keypoints from real images. The approach delivers state-of-the-art results among self-supervised methods on LM, LMO, and YCB-Video, and remains competitive with fully supervised methods across six BOP core datasets, using only unlabeled real data. With a runtime around 34 fps and a targeted reduction in real-label requirements, this method offers practical, scalable improvements for 6DoF PE in real-world robotics and vision systems.
Abstract
We address the simulation-to-real domain gap in six degree-of-freedom pose estimation (6DoF PE), and propose a novel self-supervised keypoint voting-based 6DoF PE framework, effectively narrowing this gap using a learnable kernel in RKHS. We formulate this domain gap as a distance in high-dimensional feature space, distinct from previous iterative matching methods. We propose an adapter network, which is pre-trained on purely synthetic data with synthetic ground truth poses, and which evolves the network parameters from this source synthetic domain to the target real domain. Importantly, the real data training only uses pseudo-poses estimated by pseudo-keypoints, and thereby requires no real ground truth data annotations. Our proposed method is called RKHSPose, and achieves state-of-the-art performance among self-supervised methods on three commonly used 6DoF PE datasets including LINEMOD (+4.2%), Occlusion LINEMOD (+2%), and YCB-Video (+3%). It also compares favorably to fully supervised methods on all six applicable BOP core datasets, achieving within -11.3% to +0.2% of the top fully supervised results.
