CPPF++: Uncertainty-Aware Sim2Real Object Pose Estimation by Vote Aggregation

Yang You; Wenhao He; Jin Liu; Hongkai Xiong; Weiming Wang; Cewu Lu

CPPF++: Uncertainty-Aware Sim2Real Object Pose Estimation by Vote Aggregation

Yang You, Wenhao He, Jin Liu, Hongkai Xiong, Weiming Wang, Cewu Lu

TL;DR

CPPF++ advances sim-to-real 6D pose estimation by recasting point-pair voting as a probabilistic process in canonical space to handle vote collisions. It introduces N-point tuples to enrich context, a robust noisy-pair filtering scheme, and an online alignment optimization to refine poses at inference, all fed by purely synthetic training without backgrounds. The approach achieves substantial improvements over prior sim-to-real methods on NOCS REAL275 and generalizes well to unseen datasets, while also delivering competitive performance against methods trained on real data. The DiversePose 300 dataset further provides a challenging benchmark with diverse poses and backgrounds, highlighting GPT-style generalization benefits for category-level pose estimation in real-world-like conditions.

Abstract

Object pose estimation constitutes a critical area within the domain of 3D vision. While contemporary state-of-the-art methods that leverage real-world pose annotations have demonstrated commendable performance, the procurement of such real training data incurs substantial costs. This paper focuses on a specific setting wherein only 3D CAD models are utilized as a priori knowledge, devoid of any background or clutter information. We introduce a novel method, CPPF++, designed for sim-to-real pose estimation. This method builds upon the foundational point-pair voting scheme of CPPF, reformulating it through a probabilistic view. To address the challenge posed by vote collision, we propose a novel approach that involves modeling the voting uncertainty by estimating the probabilistic distribution of each point pair within the canonical space. Furthermore, we augment the contextual information provided by each voting unit through the introduction of N-point tuples. To enhance the robustness and accuracy of the model, we incorporate several innovative modules, including noisy pair filtering, online alignment optimization, and a tuple feature ensemble. Alongside these methodological advancements, we introduce a new category-level pose estimation dataset, named DiversePose 300. Empirical evidence demonstrates that our method significantly surpasses previous sim-to-real approaches and achieves comparable or superior performance on novel datasets. Our code is available on https://github.com/qq456cvb/CPPF2.

CPPF++: Uncertainty-Aware Sim2Real Object Pose Estimation by Vote Aggregation

TL;DR

Abstract

Paper Structure (43 sections, 12 equations, 13 figures, 8 tables)

This paper contains 43 sections, 12 equations, 13 figures, 8 tables.

Introduction
Introduction
Related Works
Pose Estimation Trained on Real-World Data
Pose Estimation Trained on Synthetic Data
Method
Preliminaries: Problem Setting and CPPF Voting
A Probabilistic Uncertainty Model of Point Pair Voting
$N$-Point Tuple Feature Extraction
Noisy Pair Filtering
Importance Sample Re-weighting
Online Alignment Optimization
Tuple Feature Ensemble
Instance Mask Prediction
DiversePose 300: A New Dataset with Diverse Poses and Backgrounds
...and 28 more sections

Figures (13)

Figure 1: Unlike previous methods, our method leverages synthetic RGB-D images without backgrounds for training. During inference, a collection of $N$-point tuples are uniformly sampled to vote poses with uncertainty awareness, culminating in the final prediction as the majority vote.
Figure 2: Pipeline Overview. Our pipeline commences with a masked point cloud input, derived from an off-the-shelf instance segmentation model. Subsequently, point tuples are randomly sampled from the object. Features for each tuple are extracted and fed into a tuple encoder to obtain the tuple embedding. Following is the prediction of the canonical coordinate and scale of each tuple. During inference, the computed canonical coordinates and scales are utilized to vote for the object's center. To mitigate the influence of erroneous tuple samples, we introduce a noisy pair filtering module, enabling the orientation vote to be cast exclusively by reliable point tuples. Finally, an online alignment optimization step is employed to further refine the predicted rotation and translation, enhancing the accuracy of our model's output.
Figure 3: (a) Center voting mechanism of CPPF. Given a point pair $\mathbf{p}_1, \mathbf{p}_2$, CPPF predicts $\mu$ and $\nu$, with $\mathbf{c}$ representing the perpendicular foot. (b) Orientation voting mechanism of CPPF. For each point pair, CPPF estimates the angle $\alpha$ between $\mathbf{p}_2 - \mathbf{p}_1$ and $\mathbf{e}_1$.
Figure 4: This figure illustrates the phenomenon of voting collision for two specific objects: a snack box and a bowl. For each object, we compile a collection of input pairs characterized by similar input features. This similarity is determined by hashing the features, grouping all point pairs that fall into the same bin. Subsequently, we discretize and count the occurrences of various voting targets $\mu, \nu$ associated with these similar point pairs. The ideal scenario would be for all output voting targets to align, manifesting as a singular, prominent peak in the graph. Contrary to this ideal, the graph reveals multiple peaks, indicating that point pairs with similar input features often lead to significantly disparate voting targets. This variability underscores the difficulty in predicting a consistent deterministic relationship between input features and output voting targets, highlighting the inherent challenges posed by voting collision.
Figure 5: Contrast between Point Pairs and $N$-point Tuples. The dashed circle represents the local context surrounding each point. Left: The two point pairs, $\mathbf{p}_1,\mathbf{p}_2$ and $\mathbf{p}'_1,\mathbf{p}'_2$, possess identical relative coordinates and local contexts, rendering them indiscernible. Right: Conversely, by introducing an extra point, $\mathbf{p}_3$, and creating 3-point tuples using the initial point pairs, the resulting 3-point tuples can now be differentiated based on their relative coordinates.
...and 8 more figures

CPPF++: Uncertainty-Aware Sim2Real Object Pose Estimation by Vote Aggregation

TL;DR

Abstract

CPPF++: Uncertainty-Aware Sim2Real Object Pose Estimation by Vote Aggregation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)