Table of Contents
Fetching ...

Leveraging Positional Encoding for Robust Multi-Reference-Based Object 6D Pose Estimation

Jaewoo Park, Jaeguk Kim, Nam Ik Cho

TL;DR

This work tackles the challenges of monocular 6D pose estimation by addressing blurry geometric representations and refinement-stage local minima. It introduces Multi-Reference Pose Encoding (MRPE), which combines high-frequency positional encoding of 3D coordinates with an intrinsic-matrix–untangled render-and-compare refinement across multiple offline references, followed by a fast single-reference fine-tuning. A geometry feature extractor enriched by AdaIN conditioning and an occlusion augmentation strategy enhances robustness against occlusion and object class confusion. Empirical results on Linemod, Linemod-Occlusion, and YCB-Video demonstrate state-of-the-art performance, including in mesh-less settings, highlighting the practical impact for robust object pose estimation in real-world robotics and AR tasks.

Abstract

Accurately estimating the pose of an object is a crucial task in computer vision and robotics. There are two main deep learning approaches for this: geometric representation regression and iterative refinement. However, these methods have some limitations that reduce their effectiveness. In this paper, we analyze these limitations and propose new strategies to overcome them. To tackle the issue of blurry geometric representation, we use positional encoding with high-frequency components for the object's 3D coordinates. To address the local minimum problem in refinement methods, we introduce a normalized image plane-based multi-reference refinement strategy that's independent of intrinsic matrix constraints. Lastly, we utilize adaptive instance normalization and a simple occlusion augmentation method to help our model concentrate on the target object. Our experiments on Linemod, Linemod-Occlusion, and YCB-Video datasets demonstrate that our approach outperforms existing methods. We will soon release the code.

Leveraging Positional Encoding for Robust Multi-Reference-Based Object 6D Pose Estimation

TL;DR

This work tackles the challenges of monocular 6D pose estimation by addressing blurry geometric representations and refinement-stage local minima. It introduces Multi-Reference Pose Encoding (MRPE), which combines high-frequency positional encoding of 3D coordinates with an intrinsic-matrix–untangled render-and-compare refinement across multiple offline references, followed by a fast single-reference fine-tuning. A geometry feature extractor enriched by AdaIN conditioning and an occlusion augmentation strategy enhances robustness against occlusion and object class confusion. Empirical results on Linemod, Linemod-Occlusion, and YCB-Video demonstrate state-of-the-art performance, including in mesh-less settings, highlighting the practical impact for robust object pose estimation in real-world robotics and AR tasks.

Abstract

Accurately estimating the pose of an object is a crucial task in computer vision and robotics. There are two main deep learning approaches for this: geometric representation regression and iterative refinement. However, these methods have some limitations that reduce their effectiveness. In this paper, we analyze these limitations and propose new strategies to overcome them. To tackle the issue of blurry geometric representation, we use positional encoding with high-frequency components for the object's 3D coordinates. To address the local minimum problem in refinement methods, we introduce a normalized image plane-based multi-reference refinement strategy that's independent of intrinsic matrix constraints. Lastly, we utilize adaptive instance normalization and a simple occlusion augmentation method to help our model concentrate on the target object. Our experiments on Linemod, Linemod-Occlusion, and YCB-Video datasets demonstrate that our approach outperforms existing methods. We will soon release the code.
Paper Structure (12 sections, 7 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 12 sections, 7 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Key idea of our method: Our model estimates positional encoding, mask, and confidence of the query object and utilizes multiple references for reliable refinement estimations.
  • Figure 2: Failure cases in coordinate estimation: (a) We visualize the blurry 2D-3D coordinate estimation examples. The red box represents ground-truths and the blue box represents predictions. (b) We also visualize inlier samples selected from the RANSAC method, which fails to cut out blurry estimations.
  • Figure 3: Different patterns depending on the reference: We visualize final estimations and improvements of render-and-compare models for each test sample based on two different references.
  • Figure 4: Overview of our method: Our approach follows a four-step process. Firstly, we generate offline references. Then, for a query image, we estimate its geometric features and determine its relative pose with respect to the references. After that, we identify a reliable pose and refine it using the standard render-and-compare strategy.
  • Figure 5: Comparison with other SOTA methods: We compare our method with others: (a) our method, (b) ZebraPose su2022zebrapose, (c) CIR lipson2022coupled, (d) GDRNPP wang2021gdr, and (e) PFA hu2022perspective. The projected contours derived from the label pose and the predicted pose are depicted in green and blue, respectively.
  • ...and 2 more figures