Table of Contents
Fetching ...

OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation

Junhao Cai, Yisheng He, Weihao Yuan, Siyu Zhu, Zilong Dong, Liefeng Bo, Qifeng Chen

TL;DR

Comprehensive quantitative and qualitative experiments demonstrate that the proposed open-vocabulary method, trained on large-scale synthesized data, significantly outperforms the baseline and can effectively generalize to real-world images of unseen categories.

Abstract

This paper studies a new open-set problem, the open-vocabulary category-level object pose and size estimation. Given human text descriptions of arbitrary novel object categories, the robot agent seeks to predict the position, orientation, and size of the target object in the observed scene image. To enable such generalizability, we first introduce OO3D-9D, a large-scale photorealistic dataset for this task. Derived from OmniObject3D, OO3D-9D is the largest and most diverse dataset in the field of category-level object pose and size estimation. It includes additional annotations for the symmetry axis of each category, which help resolve symmetric ambiguity. Apart from the large-scale dataset, we find another key to enabling such generalizability is leveraging the strong prior knowledge in pre-trained visual-language foundation models. We then propose a framework built on pre-trained DinoV2 and text-to-image stable diffusion models to infer the normalized object coordinate space (NOCS) maps of the target instances. This framework fully leverages the visual semantic prior from DinoV2 and the aligned visual and language knowledge within the text-to-image diffusion model, which enables generalization to various text descriptions of novel categories. Comprehensive quantitative and qualitative experiments demonstrate that the proposed open-vocabulary method, trained on our large-scale synthesized data, significantly outperforms the baseline and can effectively generalize to real-world images of unseen categories. The project page is at https://ov9d.github.io.

OV9D: Open-Vocabulary Category-Level 9D Object Pose and Size Estimation

TL;DR

Comprehensive quantitative and qualitative experiments demonstrate that the proposed open-vocabulary method, trained on large-scale synthesized data, significantly outperforms the baseline and can effectively generalize to real-world images of unseen categories.

Abstract

This paper studies a new open-set problem, the open-vocabulary category-level object pose and size estimation. Given human text descriptions of arbitrary novel object categories, the robot agent seeks to predict the position, orientation, and size of the target object in the observed scene image. To enable such generalizability, we first introduce OO3D-9D, a large-scale photorealistic dataset for this task. Derived from OmniObject3D, OO3D-9D is the largest and most diverse dataset in the field of category-level object pose and size estimation. It includes additional annotations for the symmetry axis of each category, which help resolve symmetric ambiguity. Apart from the large-scale dataset, we find another key to enabling such generalizability is leveraging the strong prior knowledge in pre-trained visual-language foundation models. We then propose a framework built on pre-trained DinoV2 and text-to-image stable diffusion models to infer the normalized object coordinate space (NOCS) maps of the target instances. This framework fully leverages the visual semantic prior from DinoV2 and the aligned visual and language knowledge within the text-to-image diffusion model, which enables generalization to various text descriptions of novel categories. Comprehensive quantitative and qualitative experiments demonstrate that the proposed open-vocabulary method, trained on our large-scale synthesized data, significantly outperforms the baseline and can effectively generalize to real-world images of unseen categories. The project page is at https://ov9d.github.io.
Paper Structure (33 sections, 6 equations, 14 figures, 9 tables)

This paper contains 33 sections, 6 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: The open-vocabulary learning of category-level pose and size estimation is trained on a large dataset with diverse categories, such that it could be generalized to novel categories given text prompts of an unseen target object in novel scene images.
  • Figure 2: Visualization of aligned objects: Row 1 features non-symmetric toy trucks with their heads aligned to the X-axis. Row 2 presents handbags containing discrete symmetric axes, with their openings aligned to the Y-axis. Row 3 shows bowls that possess continuous symmetric axes.
  • Figure 3: Example images in OO3D-9D dataset. Single-object scenes as CO3D are displayed in the first row while challenging multi-object scenes are displayed in the second row.
  • Figure 4: Overall framework. Text features are acquired from the prompt through the CLIP model and fed to the SD UNet. By combining these text features with latent visual features from VQVAE, SD feature maps are generated. Simultaneously, the DinoV2 module processes the masked RGB image to obtain Dino features. Both features are then combined in the decoder to estimate the NOCS map of the target object. During the inference stage, the depth map is utilized to establish correspondence between NOCS and the camera frame. Finally, the object's size and pose are computed using a pose-fitting algorithm.
  • Figure 5: Qualitative results on OO3D-9D. Row 1: single object scene. Row 2: occluded scene. Predicted and ground truth boxes are colored in blue and green, respectively.
  • ...and 9 more figures