Table of Contents
Fetching ...

SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild

Patrick Rim, Kevin Harris, Braden Copple, Shangchen Han, Xu Xie, Ivan Shugurov, Sizhe An, He Wen, Alex Wong, Tomas Hodan, Kun He

Abstract

Accurate 3D understanding of human hands and objects during manipulation remains a significant challenge for egocentric computer vision. Existing hand-object interaction datasets are predominantly captured in controlled studio settings, which limits both environmental diversity and the ability of models trained on such data to generalize to real-world scenarios. To address this challenge, we introduce a novel marker-less multi-camera system that allows for nearly unconstrained mobility in genuinely in-the-wild conditions, while still having the ability to generate precise 3D annotations of hands and objects. The capture system consists of a lightweight, back-mounted, multi-camera rig that is synchronized and calibrated with a user-worn VR headset. For 3D ground-truth annotation of hands and objects, we develop an ego-exo tracking pipeline and rigorously evaluate its quality. Finally, we present SHOW3D, the first large-scale dataset with 3D annotations that show hands interacting with objects in diverse real-world environments, including outdoor settings. Our approach significantly reduces the fundamental trade-off between environmental realism and accuracy of 3D annotations, which we validate with experiments on several downstream tasks. show3d-dataset.github.io

SHOW3D: Capturing Scenes of 3D Hands and Objects in the Wild

Abstract

Accurate 3D understanding of human hands and objects during manipulation remains a significant challenge for egocentric computer vision. Existing hand-object interaction datasets are predominantly captured in controlled studio settings, which limits both environmental diversity and the ability of models trained on such data to generalize to real-world scenarios. To address this challenge, we introduce a novel marker-less multi-camera system that allows for nearly unconstrained mobility in genuinely in-the-wild conditions, while still having the ability to generate precise 3D annotations of hands and objects. The capture system consists of a lightweight, back-mounted, multi-camera rig that is synchronized and calibrated with a user-worn VR headset. For 3D ground-truth annotation of hands and objects, we develop an ego-exo tracking pipeline and rigorously evaluate its quality. Finally, we present SHOW3D, the first large-scale dataset with 3D annotations that show hands interacting with objects in diverse real-world environments, including outdoor settings. Our approach significantly reduces the fundamental trade-off between environmental realism and accuracy of 3D annotations, which we validate with experiments on several downstream tasks. show3d-dataset.github.io

Paper Structure

This paper contains 21 sections, 2 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: SHOW3D is the first dataset of in-the-wild hand–object interactions with accurate 3D annotations as well as text descriptions. The dataset was captured with our novel mobile multi-camera rig in diverse indoor and outdoor scenes, and annotated with 3D shapes and poses with our multi-view pipeline. Overlays show 3D annotations projected to egocentric images (hands in red and blue, object in green).
  • Figure 2: Our mobile multi-camera capture rig. Left: Annotated hardware layout showing five MoCap cameras (red), and eight exocentric monochrome cameras mounted in a half-dome configuration (green), and two egocentric monochrome cameras on the Meta Quest 3 headset (blue). The MoCap cameras are only used for headset pose tracking, by tracking optical markers rigidly attached to the headset. The ten total exocentric and egocentric monochrome cameras are used for marker-free annotation of 3D hand and object poses. Right: The rig in use during in-the-wild capture sessions, demonstrating its lightweight (about eight kilograms) and wearable design that allows natural interaction while maintaining stable, synchronized multi-view coverage under mostly unconstrained motion.
  • Figure 3: Our ego-exo pipeline for 3D hand and object pose annotation. (a) Multi-view fisheye images from our ego and exo cameras. (b) We detect 3D hand keypoints by fusing predictions from Sapiens khirodkar2024sapiens and InterNet Moon_2020_ECCV_InterHand2.6M, and fit personalized hand mesh via Inverse Kinematics. (c) CAD-based 3D object pose estimation using CNOS nguyen2023cnos, FoundPose ornek2024foundpose, and GoTrack nguyen2025gotrack. (d) The resulting 3D ground-truth annotations are projected back into ego cameras, and can be used to train egocentric vision models.
  • Figure 4: Cross-dataset feature embedding. We plot UMAP mcinnes2018umap-software embeddings of DINOv2 oquab2023dinov2 features extracted from the raw images across different hand–object interaction datasets. SHOW3D (pink) spans diverse visual domains between datasets collected in controlled environments: GigaHands (blue), HOT3D (green), ARCTIC (yellow). Best viewed in color.
  • Figure 5: Hand pose estimation on the SHOW3D test set. From left to right: results from models trained on UmeTrack + HOT3D + SHOW3D, HOT3D, and UmeTrack, respectively. Including training data from SHOW3D significantly improves robustness against object occlusion and background clutter.
  • ...and 11 more figures