Structure-Aware Correspondence Learning for Relative Pose Estimation
Yihan Chen, Wenfei Yang, Huan Ren, Shifeng Zhang, Tianzhu Zhang, Feng Wu
TL;DR
This work tackles object-relative pose estimation for unseen categories by proposing Structure-Aware Correspondence Learning, which jointly learns structure-aware keypoints and structure-aware features for 3D-3D correspondence without explicit feature matching. The method employs learnable keypoint queries with image reconstruction supervision, intra- and inter-image attention with ROPE-based encoding, and 3D coordinate regression for keypoints, followed by a weighted SVD to recover the relative rotation. Across CO3D, Objaverse, and LineMOD, it achieves state-of-the-art results, notably reducing mean angular error by about $6^{\circ}$ on CO3D and improving angular-accuracy metrics by substantial margins, demonstrating strong generalization to unseen objects. The approach offers a practical path toward object-agnostic pose estimation by leveraging object structure rather than dense 2D- or 3D- feature matching, with implications for AR, robotics, and autonomous systems.
Abstract
Relative pose estimation provides a promising way for achieving object-agnostic pose estimation. Despite the success of existing 3D correspondence-based methods, the reliance on explicit feature matching suffers from small overlaps in visible regions and unreliable feature estimation for invisible regions. Inspired by humans' ability to assemble two object parts that have small or no overlapping regions by considering object structure, we propose a novel Structure-Aware Correspondence Learning method for Relative Pose Estimation, which consists of two key modules. First, a structure-aware keypoint extraction module is designed to locate a set of kepoints that can represent the structure of objects with different shapes and appearance, under the guidance of a keypoint based image reconstruction loss. Second, a structure-aware correspondence estimation module is designed to model the intra-image and inter-image relationships between keypoints to extract structure-aware features for correspondence estimation. By jointly leveraging these two modules, the proposed method can naturally estimate 3D-3D correspondences for unseen objects without explicit feature matching for precise relative pose estimation. Experimental results on the CO3D, Objaverse and LineMOD datasets demonstrate that the proposed method significantly outperforms prior methods, i.e., with 5.7°reduction in mean angular error on the CO3D dataset.
