Table of Contents
Fetching ...

GAPartManip: A Large-scale Part-centric Dataset for Material-Agnostic Articulated Object Manipulation

Wenbo Cui, Chengyang Zhao, Songlin Wei, Jiazhao Zhang, Haoran Geng, Yaran Chen, Haoran Li, He Wang

TL;DR

GAPartManip tackles the problem of robust articulated-object manipulation under material-induced sensing challenges by introducing a large-scale, part-centric synthetic dataset with realistic IR rendering and dense, scene-level actionable pose annotations. The authors propose a modular framework with a diffusion-based depth reconstruction module and a Part-aware pose prediction module (Part-aware EcoGrasp) coupled with a local planner, enabling zero-shot sim-to-real transfer. The dataset comprises 918 object instances across 19 categories, thousands of scene-level samples, and billions of actionable poses, generated with domain randomization and GPU-accelerated annotation. Experiments demonstrate significant improvements in depth estimation and actionable pose prediction in both simulation and real-world settings, establishing state-of-the-art performance for articulated-object manipulation and enabling robust home-robot interaction. The work is designed to facilitate generalizable manipulation in diverse home environments and will be released as open-source resources.

Abstract

Effectively manipulating articulated objects in household scenarios is a crucial step toward achieving general embodied artificial intelligence. Mainstream research in 3D vision has primarily focused on manipulation through depth perception and pose detection. However, in real-world environments, these methods often face challenges due to imperfect depth perception, such as with transparent lids and reflective handles. Moreover, they generally lack the diversity in part-based interactions required for flexible and adaptable manipulation. To address these challenges, we introduced a large-scale part-centric dataset for articulated object manipulation that features both photo-realistic material randomization and detailed annotations of part-oriented, scene-level actionable interaction poses. We evaluated the effectiveness of our dataset by integrating it with several state-of-the-art methods for depth estimation and interaction pose prediction. Additionally, we proposed a novel modular framework that delivers superior and robust performance for generalizable articulated object manipulation. Our extensive experiments demonstrate that our dataset significantly improves the performance of depth perception and actionable interaction pose prediction in both simulation and real-world scenarios. More information and demos can be found at: https://pku-epic.github.io/GAPartManip/.

GAPartManip: A Large-scale Part-centric Dataset for Material-Agnostic Articulated Object Manipulation

TL;DR

GAPartManip tackles the problem of robust articulated-object manipulation under material-induced sensing challenges by introducing a large-scale, part-centric synthetic dataset with realistic IR rendering and dense, scene-level actionable pose annotations. The authors propose a modular framework with a diffusion-based depth reconstruction module and a Part-aware pose prediction module (Part-aware EcoGrasp) coupled with a local planner, enabling zero-shot sim-to-real transfer. The dataset comprises 918 object instances across 19 categories, thousands of scene-level samples, and billions of actionable poses, generated with domain randomization and GPU-accelerated annotation. Experiments demonstrate significant improvements in depth estimation and actionable pose prediction in both simulation and real-world settings, establishing state-of-the-art performance for articulated-object manipulation and enabling robust home-robot interaction. The work is designed to facilitate generalizable manipulation in diverse home environments and will be released as open-source resources.

Abstract

Effectively manipulating articulated objects in household scenarios is a crucial step toward achieving general embodied artificial intelligence. Mainstream research in 3D vision has primarily focused on manipulation through depth perception and pose detection. However, in real-world environments, these methods often face challenges due to imperfect depth perception, such as with transparent lids and reflective handles. Moreover, they generally lack the diversity in part-based interactions required for flexible and adaptable manipulation. To address these challenges, we introduced a large-scale part-centric dataset for articulated object manipulation that features both photo-realistic material randomization and detailed annotations of part-oriented, scene-level actionable interaction poses. We evaluated the effectiveness of our dataset by integrating it with several state-of-the-art methods for depth estimation and interaction pose prediction. Additionally, we proposed a novel modular framework that delivers superior and robust performance for generalizable articulated object manipulation. Our extensive experiments demonstrate that our dataset significantly improves the performance of depth perception and actionable interaction pose prediction in both simulation and real-world scenarios. More information and demos can be found at: https://pku-epic.github.io/GAPartManip/.

Paper Structure

This paper contains 19 sections, 3 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: GAPartManip. We introduce a large-scale part-centric dataset for material-agnostic articulated object manipulation. It encompasses 19 common household articulated categories, totaling 918 object instances, 240K photo-realistic rendering images, and 8 billion scene-level actionable interaction poses. GAPartManip enables robust zero-shot sim-to-real transfer for accomplishing articulated object manipulation tasks.
  • Figure 2: Data Examples in GAPartManip. GAPartManip is a novel large-scale synthetic dataset for articulated objects, featuring two important aspects: 1) realistic, physics-based IR rendering for various object materials in diverse scenes, and 2) part-oriented actionable interaction pose annotations for a wide range of articulated objects. Each column shows a data sample. From top to bottom, each row displays the RGB image, the IR image (only the left IR image is shown here), and the scene-level actionable interaction pose annotations.
  • Figure 3: Dataset Generation Pipeline. For scene-level data sample rendering, we input the object asset into our photo-realistic rendering pipeline, generating one RGB image and two IR images (left and right) for each camera perspective. For pose annotation, we begin by performing mesh fusion on each GAPart on the object to establish a one-to-one correspondence between GAParts and meshes. Then, we use FPS to obtain the point cloud for each GAPart, enabling part-level stable interaction pose annotation. These poses are further utilized in scene-level actionable interaction pose annotation for each rendering data sample.
  • Figure 4: Framework Overview. Given IR images and raw depth map, the depth reconstruction module first performs depth recovery. Subsequently, the pose prediction module generates a 7-DoF actionable pose and a 6-DoF post-grasping motion for interaction based on the reconstructed depth. Finally, the local planner module carries out the action planning and execution.
  • Figure 5: Qualitative Results for Depth Estimation in the Real World. Our refined depth maps are cleaner and more accurate than the ones from the baseline, indicating that our depth reconstruction module is more robust for transparent and translucent lids and small handles. Zoom in to better observe small parts like handles and knobs.
  • ...and 2 more figures