LocPoseNet: Robust Location Prior for Unseen Object Pose Estimation
Chen Zhao, Yinlin Hu, Mathieu Salzmann
TL;DR
LocPoseNet addresses unseen-object 6D pose estimation by learning a robust location prior consisting of a 2D center $\mathbf{c}_q$ and a size $s_q$ from RGB imagery. It uses a Siamese template-matching paradigm where reference features serve as convolution kernels to generate multi-scale correlations through a novel kernel-distribution mechanism, coupled with a decoupled translation estimator that predicts $\mathbf{c}_q$ and $s_q$ and computes the translation $\mathbf{T}_q$ via $\mathbf{T}_q=\frac{\tilde{f}s_{3d}}{s_q}\mathbf{K}^{-1}\hat{\mathbf{c}}_q$. Key contributions include efficient, multi-scale correlation estimation without multiple backbone passes, a deterministic fusion for center estimation, and a scale-aware embedding for size prediction, all validated on LINEMOD, GenMOP, and a challenging synthetic dataset to demonstrate improved unseen-object localization and 6D pose estimation. The approach yields state-of-the-art localization accuracy for unseen objects and robust 6D pose performance, highlighting the practical impact of accurate location priors in generalizable pose estimation pipelines.
Abstract
Object location prior is critical for the standard 6D object pose estimation setting. The prior can be used to initialize the 3D object translation and facilitate 3D object rotation estimation. Unfortunately, the object detectors that are used for this purpose do not generalize to unseen objects. Therefore, existing 6D pose estimation methods for unseen objects either assume the ground-truth object location to be known or yield inaccurate results when it is unavailable. In this paper, we address this problem by developing a method, LocPoseNet, able to robustly learn location prior for unseen objects. Our method builds upon a template matching strategy, where we propose to distribute the reference kernels and convolve them with a query to efficiently compute multi-scale correlations. We then introduce a novel translation estimator, which decouples scale-aware and scale-robust features to predict different object location parameters. Our method outperforms existing works by a large margin on LINEMOD and GenMOP. We further construct a challenging synthetic dataset, which allows us to highlight the better robustness of our method to various noise sources. Our project website is at: https://sailor-z.github.io/projects/3DV2024_LocPoseNet.html.
