Table of Contents
Fetching ...

Learning Positive-Incentive Point Sampling in Neural Implicit Fields for Object Pose Estimation

Yifei Shi, Boyan Wan, Xin Xu, Kai Xu

TL;DR

A method combining an SO(3)-equivariant convolutional implicit network and a positive-incentive point sampling strategy, demonstrating superior performance compared to most existing baselines and demonstrating significant improvements in challenging scenarios, such as objects captured with unseen pose, high occlusion, novel geometry, and severe noise.

Abstract

Learning neural implicit fields of 3D shapes is a rapidly emerging field that enables shape representation at arbitrary resolutions. Due to the flexibility, neural implicit fields have succeeded in many research areas, including shape reconstruction, novel view image synthesis, and more recently, object pose estimation. Neural implicit fields enable learning dense correspondences between the camera space and the object's canonical space-including unobserved regions in camera space-significantly boosting object pose estimation performance in challenging scenarios like highly occluded objects and novel shapes. Despite progress, predicting canonical coordinates for unobserved camera-space regions remains challenging due to the lack of direct observational signals. This necessitates heavy reliance on the model's generalization ability, resulting in high uncertainty. Consequently, densely sampling points across the entire camera space may yield inaccurate estimations that hinder the learning process and compromise performance. To alleviate this problem, we propose a method combining an SO(3)-equivariant convolutional implicit network and a positive-incentive point sampling (PIPS) strategy. The SO(3)-equivariant convolutional implicit network estimates point-level attributes with SO(3)-equivariance at arbitrary query locations, demonstrating superior performance compared to most existing baselines. The PIPS strategy dynamically determines sampling locations based on the input, thereby boosting the network's accuracy and training efficiency. Our method outperforms the state-of-the-art on three pose estimation datasets. Notably, it demonstrates significant improvements in challenging scenarios, such as objects captured with unseen pose, high occlusion, novel geometry, and severe noise.

Learning Positive-Incentive Point Sampling in Neural Implicit Fields for Object Pose Estimation

TL;DR

A method combining an SO(3)-equivariant convolutional implicit network and a positive-incentive point sampling strategy, demonstrating superior performance compared to most existing baselines and demonstrating significant improvements in challenging scenarios, such as objects captured with unseen pose, high occlusion, novel geometry, and severe noise.

Abstract

Learning neural implicit fields of 3D shapes is a rapidly emerging field that enables shape representation at arbitrary resolutions. Due to the flexibility, neural implicit fields have succeeded in many research areas, including shape reconstruction, novel view image synthesis, and more recently, object pose estimation. Neural implicit fields enable learning dense correspondences between the camera space and the object's canonical space-including unobserved regions in camera space-significantly boosting object pose estimation performance in challenging scenarios like highly occluded objects and novel shapes. Despite progress, predicting canonical coordinates for unobserved camera-space regions remains challenging due to the lack of direct observational signals. This necessitates heavy reliance on the model's generalization ability, resulting in high uncertainty. Consequently, densely sampling points across the entire camera space may yield inaccurate estimations that hinder the learning process and compromise performance. To alleviate this problem, we propose a method combining an SO(3)-equivariant convolutional implicit network and a positive-incentive point sampling (PIPS) strategy. The SO(3)-equivariant convolutional implicit network estimates point-level attributes with SO(3)-equivariance at arbitrary query locations, demonstrating superior performance compared to most existing baselines. The PIPS strategy dynamically determines sampling locations based on the input, thereby boosting the network's accuracy and training efficiency. Our method outperforms the state-of-the-art on three pose estimation datasets. Notably, it demonstrates significant improvements in challenging scenarios, such as objects captured with unseen pose, high occlusion, novel geometry, and severe noise.
Paper Structure (27 sections, 17 equations, 15 figures, 8 tables)

This paper contains 27 sections, 17 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: We propose PIPS, a data-driven approach to dynamically determine where to sample to boost the network training, achieving better performance with training on fewer sampling points, compared to (a) the random sampling baseline. PIPS consists of two components: (b) positive-incentive point sampling with high estimation certainty (PIPS-C) and (c) positive-incentive point sampling with high geometric stability (PIPS-S).
  • Figure 2: (a) Point-wise canonical coordinate estimation on as few as three points (in red) is sufficient for determining all the 6-DoFs of object pose. (b) Extra voters with inaccurate point-level estimations (in blue) would degrade the performance.
  • Figure 3: The quantitative comparisons of the proposed PIPS-C and PIPS-S to the baseline of random sampling. We see our method reduce the number of sample points and the training time while achieving better performance in object pose estimation. The experiment is conducted on the NOCS-REAL275 dataset.
  • Figure 4: Overview of the proposed method. First, an SO(3)-equivariant convolutional implicit network with dense point sampling (the teacher model) is optimized to generate the pseudo ground-truth. Second, the PIPS-C and PIPS-S estimation networks (the student model) are trained based on the generated pseudo ground-truth. Third, an SO(3)-equivariant convolutional implicit network is trained with the sample points estimated by the PIPS estimation network.
  • Figure 5: (a) By rotating the 3D graph convolution kernel via a regular icosahedron rotation group, we generate a set of convolutional kernels. (b) Point cloud convolutions with the rotated convolutional kernels make the generated features SO(3)-invariant. By generating vector neurons by the SO(3)-invariant point feature and multiplying the vector neurons with the rotation matrix $R_a$ corresponding to the $q_a\in Q$ with the highest activation, the feature becomes SO(3)-equivariant.
  • ...and 10 more figures