Table of Contents
Fetching ...

HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction

Jikai Wang, Qifan Zhang, Yu-Wei Chao, Bowen Wen, Xiaohu Guo, Yu Xiang

TL;DR

HO-Cap introduces a markerless, scalable capture system for hand-object interaction using eight RGB-D cameras plus a HoloLens, paired with a semi-automatic annotation pipeline that yields 3D hand and object shapes and poses without domain-specific training. The workflow combines BundleSDF-based object reconstruction, multi-view pose initialization with FoundationPose, SDF-based pose refinement, MANO hand modeling, and joint hand-object optimization to produce coherent 3D annotations. The HO-Cap dataset contains 64 videos, 656K frames, 9 subjects, and 64 textured objects, with ground-truth 3D shapes/poses and egocentric FPV data, enabling benchmarks for hand pose, object detection, and novel pose estimation. The work demonstrates baseline performance and discusses practical limitations, underscoring HO-Cap’s potential to advance embodied AI and robot manipulation research.

Abstract

We introduce a data capture system and a new dataset, HO-Cap, for 3D reconstruction and pose tracking of hands and objects in videos. The system leverages multiple RGBD cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or mocap systems. We propose a semi-automatic method for annotating the shape and pose of hands and objects in the collected videos, significantly reducing the annotation time compared to manual labeling. With this system, we captured a video dataset of humans interacting with objects to perform various tasks, including simple pick-and-place actions, handovers between hands, and using objects according to their affordance, which can serve as human demonstrations for research in embodied AI and robot manipulation. Our data capture setup and annotation framework will be available for the community to use in reconstructing 3D shapes of objects and human hands and tracking their poses in videos.

HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction

TL;DR

HO-Cap introduces a markerless, scalable capture system for hand-object interaction using eight RGB-D cameras plus a HoloLens, paired with a semi-automatic annotation pipeline that yields 3D hand and object shapes and poses without domain-specific training. The workflow combines BundleSDF-based object reconstruction, multi-view pose initialization with FoundationPose, SDF-based pose refinement, MANO hand modeling, and joint hand-object optimization to produce coherent 3D annotations. The HO-Cap dataset contains 64 videos, 656K frames, 9 subjects, and 64 textured objects, with ground-truth 3D shapes/poses and egocentric FPV data, enabling benchmarks for hand pose, object detection, and novel pose estimation. The work demonstrates baseline performance and discusses practical limitations, underscoring HO-Cap’s potential to advance embodied AI and robot manipulation research.

Abstract

We introduce a data capture system and a new dataset, HO-Cap, for 3D reconstruction and pose tracking of hands and objects in videos. The system leverages multiple RGBD cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or mocap systems. We propose a semi-automatic method for annotating the shape and pose of hands and objects in the collected videos, significantly reducing the annotation time compared to manual labeling. With this system, we captured a video dataset of humans interacting with objects to perform various tasks, including simple pick-and-place actions, handovers between hands, and using objects according to their affordance, which can serve as human demonstrations for research in embodied AI and robot manipulation. Our data capture setup and annotation framework will be available for the community to use in reconstructing 3D shapes of objects and human hands and tracking their poses in videos.
Paper Structure (31 sections, 7 equations, 12 figures, 8 tables)

This paper contains 31 sections, 7 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Examples of RGB frames and renderings of the 3D shape and pose annotations of hands and objects in our dataset to images and in the NVIDIA Isaac Sim simulation.
  • Figure 2: Illustration of our data capture setup.
  • Figure 3: Illustration of our pipeline for 3D object reconstruction.
  • Figure 4: Illustration of our pipeline for obtaining poses of hands and objects from multi-view RGB-D videos.
  • Figure 5: Comparison between the published and refined HoloLens poses
  • ...and 7 more figures