HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction
Jikai Wang, Qifan Zhang, Yu-Wei Chao, Bowen Wen, Xiaohu Guo, Yu Xiang
TL;DR
HO-Cap introduces a markerless, scalable capture system for hand-object interaction using eight RGB-D cameras plus a HoloLens, paired with a semi-automatic annotation pipeline that yields 3D hand and object shapes and poses without domain-specific training. The workflow combines BundleSDF-based object reconstruction, multi-view pose initialization with FoundationPose, SDF-based pose refinement, MANO hand modeling, and joint hand-object optimization to produce coherent 3D annotations. The HO-Cap dataset contains 64 videos, 656K frames, 9 subjects, and 64 textured objects, with ground-truth 3D shapes/poses and egocentric FPV data, enabling benchmarks for hand pose, object detection, and novel pose estimation. The work demonstrates baseline performance and discusses practical limitations, underscoring HO-Cap’s potential to advance embodied AI and robot manipulation research.
Abstract
We introduce a data capture system and a new dataset, HO-Cap, for 3D reconstruction and pose tracking of hands and objects in videos. The system leverages multiple RGBD cameras and a HoloLens headset for data collection, avoiding the use of expensive 3D scanners or mocap systems. We propose a semi-automatic method for annotating the shape and pose of hands and objects in the collected videos, significantly reducing the annotation time compared to manual labeling. With this system, we captured a video dataset of humans interacting with objects to perform various tasks, including simple pick-and-place actions, handovers between hands, and using objects according to their affordance, which can serve as human demonstrations for research in embodied AI and robot manipulation. Our data capture setup and annotation framework will be available for the community to use in reconstructing 3D shapes of objects and human hands and tracking their poses in videos.
