Table of Contents
Fetching ...

Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking

Jiyao Zhang, Weiyao Huang, Bo Peng, Mingdong Wu, Fei Hu, Zijian Chen, Bo Zhao, Hao Dong

TL;DR

GenPose++ is introduced, an enhanced version of the SOTA category-level pose estimation framework, incorporating two pivotal improvements: Semantic-aware feature extraction and Clustering-based aggregation to address the challenge of 6D object pose estimation and pose tracking.

Abstract

6D Object Pose Estimation is a crucial yet challenging task in computer vision, suffering from a significant lack of large-scale datasets. This scarcity impedes comprehensive evaluation of model performance, limiting research advancements. Furthermore, the restricted number of available instances or categories curtails its applications. To address these issues, this paper introduces Omni6DPose, a substantial dataset characterized by its diversity in object categories, large scale, and variety in object materials. Omni6DPose is divided into three main components: ROPE (Real 6D Object Pose Estimation Dataset), which includes 332K images annotated with over 1.5M annotations across 581 instances in 149 categories; SOPE(Simulated 6D Object Pose Estimation Dataset), consisting of 475K images created in a mixed reality setting with depth simulation, annotated with over 5M annotations across 4162 instances in the same 149 categories; and the manually aligned real scanned objects used in both ROPE and SOPE. Omni6DPose is inherently challenging due to the substantial variations and ambiguities. To address this challenge, we introduce GenPose++, an enhanced version of the SOTA category-level pose estimation framework, incorporating two pivotal improvements: Semantic-aware feature extraction and Clustering-based aggregation. Moreover, we provide a comprehensive benchmarking analysis to evaluate the performance of previous methods on this large-scale dataset in the realms of 6D object pose estimation and pose tracking.

Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking

TL;DR

GenPose++ is introduced, an enhanced version of the SOTA category-level pose estimation framework, incorporating two pivotal improvements: Semantic-aware feature extraction and Clustering-based aggregation to address the challenge of 6D object pose estimation and pose tracking.

Abstract

6D Object Pose Estimation is a crucial yet challenging task in computer vision, suffering from a significant lack of large-scale datasets. This scarcity impedes comprehensive evaluation of model performance, limiting research advancements. Furthermore, the restricted number of available instances or categories curtails its applications. To address these issues, this paper introduces Omni6DPose, a substantial dataset characterized by its diversity in object categories, large scale, and variety in object materials. Omni6DPose is divided into three main components: ROPE (Real 6D Object Pose Estimation Dataset), which includes 332K images annotated with over 1.5M annotations across 581 instances in 149 categories; SOPE(Simulated 6D Object Pose Estimation Dataset), consisting of 475K images created in a mixed reality setting with depth simulation, annotated with over 5M annotations across 4162 instances in the same 149 categories; and the manually aligned real scanned objects used in both ROPE and SOPE. Omni6DPose is inherently challenging due to the substantial variations and ambiguities. To address this challenge, we introduce GenPose++, an enhanced version of the SOTA category-level pose estimation framework, incorporating two pivotal improvements: Semantic-aware feature extraction and Clustering-based aggregation. Moreover, we provide a comprehensive benchmarking analysis to evaluate the performance of previous methods on this large-scale dataset in the realms of 6D object pose estimation and pose tracking.
Paper Structure (41 sections, 5 equations, 13 figures, 4 tables)

This paper contains 41 sections, 5 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: We introduce a universal 6D object pose estimation dataset, Omni6DPose. The middle section showcases some examples of canonically aligned objects from our dataset, with samples of SOPE depicted on the left and samples of ROPE on the right.
  • Figure 2: ROPE dataset visualization. In the figure, bounding boxes are colored according to the coordinates in the object's coordinate system.
  • Figure 3: ROPE dataset collection and annotation. (1) Object scanning, where high-precision industrial scanners are used to acquire the CAD models of objects; (2) Object canonicalization, involving the alignment of each object category to the canonical space; (3) Video capture, capturing video sequences in varied scenarios with a depth camera; and (4) Pose annotation, calculating camera poses through Structure from Motion (SFM), further utilizing Farthest Point Sampling (FPS) to select keyframes for keypoint annotation, and performing bundle adjustment to derive initial object pose values, which are then manually refined to obtain more precise annotations.
  • Figure 4: SOPE synthesis, utilizing mixed reality to bridge the RGB sim2real gap and physical-based depth sensor simulation to minimize the geometric sim2real gap.
  • Figure 5: Omni6DPose statistics, showcasing the dataset distribution. Left: Category distribution, highlighting 149 categories and diverse materials. Right: Object size distribution across 5000 objects, illustrating diversity in shapes.
  • ...and 8 more figures