Table of Contents
Fetching ...

LocPoseNet: Robust Location Prior for Unseen Object Pose Estimation

Chen Zhao, Yinlin Hu, Mathieu Salzmann

TL;DR

LocPoseNet addresses unseen-object 6D pose estimation by learning a robust location prior consisting of a 2D center $\mathbf{c}_q$ and a size $s_q$ from RGB imagery. It uses a Siamese template-matching paradigm where reference features serve as convolution kernels to generate multi-scale correlations through a novel kernel-distribution mechanism, coupled with a decoupled translation estimator that predicts $\mathbf{c}_q$ and $s_q$ and computes the translation $\mathbf{T}_q$ via $\mathbf{T}_q=\frac{\tilde{f}s_{3d}}{s_q}\mathbf{K}^{-1}\hat{\mathbf{c}}_q$. Key contributions include efficient, multi-scale correlation estimation without multiple backbone passes, a deterministic fusion for center estimation, and a scale-aware embedding for size prediction, all validated on LINEMOD, GenMOP, and a challenging synthetic dataset to demonstrate improved unseen-object localization and 6D pose estimation. The approach yields state-of-the-art localization accuracy for unseen objects and robust 6D pose performance, highlighting the practical impact of accurate location priors in generalizable pose estimation pipelines.

Abstract

Object location prior is critical for the standard 6D object pose estimation setting. The prior can be used to initialize the 3D object translation and facilitate 3D object rotation estimation. Unfortunately, the object detectors that are used for this purpose do not generalize to unseen objects. Therefore, existing 6D pose estimation methods for unseen objects either assume the ground-truth object location to be known or yield inaccurate results when it is unavailable. In this paper, we address this problem by developing a method, LocPoseNet, able to robustly learn location prior for unseen objects. Our method builds upon a template matching strategy, where we propose to distribute the reference kernels and convolve them with a query to efficiently compute multi-scale correlations. We then introduce a novel translation estimator, which decouples scale-aware and scale-robust features to predict different object location parameters. Our method outperforms existing works by a large margin on LINEMOD and GenMOP. We further construct a challenging synthetic dataset, which allows us to highlight the better robustness of our method to various noise sources. Our project website is at: https://sailor-z.github.io/projects/3DV2024_LocPoseNet.html.

LocPoseNet: Robust Location Prior for Unseen Object Pose Estimation

TL;DR

LocPoseNet addresses unseen-object 6D pose estimation by learning a robust location prior consisting of a 2D center and a size from RGB imagery. It uses a Siamese template-matching paradigm where reference features serve as convolution kernels to generate multi-scale correlations through a novel kernel-distribution mechanism, coupled with a decoupled translation estimator that predicts and and computes the translation via . Key contributions include efficient, multi-scale correlation estimation without multiple backbone passes, a deterministic fusion for center estimation, and a scale-aware embedding for size prediction, all validated on LINEMOD, GenMOP, and a challenging synthetic dataset to demonstrate improved unseen-object localization and 6D pose estimation. The approach yields state-of-the-art localization accuracy for unseen objects and robust 6D pose performance, highlighting the practical impact of accurate location priors in generalizable pose estimation pipelines.

Abstract

Object location prior is critical for the standard 6D object pose estimation setting. The prior can be used to initialize the 3D object translation and facilitate 3D object rotation estimation. Unfortunately, the object detectors that are used for this purpose do not generalize to unseen objects. Therefore, existing 6D pose estimation methods for unseen objects either assume the ground-truth object location to be known or yield inaccurate results when it is unavailable. In this paper, we address this problem by developing a method, LocPoseNet, able to robustly learn location prior for unseen objects. Our method builds upon a template matching strategy, where we propose to distribute the reference kernels and convolve them with a query to efficiently compute multi-scale correlations. We then introduce a novel translation estimator, which decouples scale-aware and scale-robust features to predict different object location parameters. Our method outperforms existing works by a large margin on LINEMOD and GenMOP. We further construct a challenging synthetic dataset, which allows us to highlight the better robustness of our method to various noise sources. Our project website is at: https://sailor-z.github.io/projects/3DV2024_LocPoseNet.html.
Paper Structure (13 sections, 9 equations, 9 figures, 6 tables)

This paper contains 13 sections, 9 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Location prior for unseen object pose estimation. We tackle the case where the unseen object comes from a category (cat) that was not included in the training data. An accurate location prior greatly facilitates 6D object pose estimation: Under a pinhole camera model, it provides valuable information about object translation. Furthermore, it facilitates rotation prediction by cropping the object, thus limiting the influence of the background.
  • Figure 2: Limitations of previous methods. Multi-scale correlations are computed by passing the query image at $n$ different resolutions through the feature extraction backbone, which is inefficient. Moreover, it yields noisy correlation maps with incorrect high responses (red circles), which in turn interfere with the prediction of the object center $\mathbf{c}_q$.
  • Figure 3: Network architecture. Our network takes a query and a set of references as input. The feature extraction backbone is shared by the query and references. We efficiently capture multi-scale correlations with adjustable receptive fields over the query image (indicated by different colors). The correlations are fed into the presented estimator, where the scale-robust and scale-aware features are separately learned. We predict the object location parameters utilizing cross-reference consistencies, and then compute 3D object translation by using Eq. \ref{['eq:translation']}.
  • Figure 4: Efficient multi-scale correlation estimation module. A reference kernel is distributed at different spatial sizes. The query feature map $\mathbf{F}^{q}$ is then convolved with all the ditributed kernels. The resulting $(c_1,c_2,c_3)$ capture information from the query with different receptive fields, as indicated by the colored boxes.
  • Figure 5: Illustrations of the correlation maps and the deterministic feature fusion. (a) shows two correlation maps and the bottom one with multiple peaks is unreliable when computing the object center. To mitigate the impact of such noisy correlation maps, we present a fusion approach where smaller weights are assigned to the noisy maps in a deterministic manner. The blue curve represents the computed weights.
  • ...and 4 more figures