Table of Contents
Fetching ...

HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos

Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon, Shreyas Hampali, Shangchen Han, Fan Zhang, Linguang Zhang, Jade Fountain, Edward Miller, Selen Basol, Richard Newcombe, Robert Wang, Jakob Julian Engel, Tomas Hodan

TL;DR

HOT3D tackles the challenge of robust 3D hand-object tracking from egocentric multi-view video. It provides a large-scale, richly annotated dataset combining Aria and Quest 3 streams with mocap ground-truth for hands and objects, plus annotated 3D object models and gaze. The authors demonstrate that multi-view approaches markedly outperform single-view baselines across 3D hand tracking, 6DoF object pose estimation, and 3D lifting, and they establish strong baselines using FoundPose extensions and stereo matching. The dataset and accompanying benchmarks, tutorials, and onboarding sequences are designed to spur progress in AR/VR, robotics, and AI assistants.

Abstract

We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. The dataset offers over 833 minutes (3.7M+ images) of recordings that feature 19 subjects interacting with 33 diverse rigid objects. In addition to simple pick-up, observe, and put-down actions, the subjects perform actions typical for a kitchen, office, and living room environment. The recordings include multiple synchronized data streams containing egocentric multi-view RGB/monochrome images, eye gaze signal, scene point clouds, and 3D poses of cameras, hands, and objects. The dataset is recorded with two headsets from Meta: Project Aria, which is a research prototype of AI glasses, and Quest 3, a virtual-reality headset that has shipped millions of units. Ground-truth poses were obtained by a motion-capture system using small optical markers attached to hands and objects. Hand annotations are provided in the UmeTrack and MANO formats, and objects are represented by 3D meshes with PBR materials obtained by an in-house scanner. In our experiments, we demonstrate the effectiveness of multi-view egocentric data for three popular tasks: 3D hand tracking, model-based 6DoF object pose estimation, and 3D lifting of unknown in-hand objects. The evaluated multi-view methods, whose benchmarking is uniquely enabled by HOT3D, significantly outperform their single-view counterparts.

HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos

TL;DR

HOT3D tackles the challenge of robust 3D hand-object tracking from egocentric multi-view video. It provides a large-scale, richly annotated dataset combining Aria and Quest 3 streams with mocap ground-truth for hands and objects, plus annotated 3D object models and gaze. The authors demonstrate that multi-view approaches markedly outperform single-view baselines across 3D hand tracking, 6DoF object pose estimation, and 3D lifting, and they establish strong baselines using FoundPose extensions and stereo matching. The dataset and accompanying benchmarks, tutorials, and onboarding sequences are designed to spur progress in AR/VR, robotics, and AI assistants.

Abstract

We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. The dataset offers over 833 minutes (3.7M+ images) of recordings that feature 19 subjects interacting with 33 diverse rigid objects. In addition to simple pick-up, observe, and put-down actions, the subjects perform actions typical for a kitchen, office, and living room environment. The recordings include multiple synchronized data streams containing egocentric multi-view RGB/monochrome images, eye gaze signal, scene point clouds, and 3D poses of cameras, hands, and objects. The dataset is recorded with two headsets from Meta: Project Aria, which is a research prototype of AI glasses, and Quest 3, a virtual-reality headset that has shipped millions of units. Ground-truth poses were obtained by a motion-capture system using small optical markers attached to hands and objects. Hand annotations are provided in the UmeTrack and MANO formats, and objects are represented by 3D meshes with PBR materials obtained by an in-house scanner. In our experiments, we demonstrate the effectiveness of multi-view egocentric data for three popular tasks: 3D hand tracking, model-based 6DoF object pose estimation, and 3D lifting of unknown in-hand objects. The evaluated multi-view methods, whose benchmarking is uniquely enabled by HOT3D, significantly outperform their single-view counterparts.

Paper Structure

This paper contains 18 sections, 14 figures, 7 tables.

Figures (14)

  • Figure 1: HOT3D overview. The dataset includes multi-view egocentric image streams from Aria engel2023project and Quest 3 Quest3 annotated with high-quality ground-truth 3D poses and models of hands and objects. Three multi-view frames from Aria are shown on the left, with contours of 3D models of hands and objects in the ground-truth poses in white and green, respectively. Aria also provides 3D point clouds from SLAM and eye gaze information (right).
  • Figure 2: Sample images from Aria (top) and Quest 3 (bottom). Aria recordings include one RGB and two monochrome image streams, while Quest 3 recordings include two monochrome streams (only images from one of the multi-view streams are shown). Contours of 3D models of hands and objects in the ground-truth poses are shown in white and green respectively. In addition to simple pick-up/observe/put-down actions, the subjects perform actions that are common in a kitchen, office, and living room. To increase diversity, the lighting, furniture, and decorations in the capture lab were regularly randomized.
  • Figure 3: High-quality 3D mesh models. This image shows a rendering of the 33 object models, demonstrating their quality. The models were obtained by an in-house scanning-based 3D object reconstruction pipeline and include PBR materials, which enable rendering of photo-realistic training images for methods that require it. The collection includes household and office objects of diverse appearance, size, and affordances.
  • Figure 4: Distances traveled by HOT3D objects. In total, subjects moved the 33 objects over 13 km. While objects like the keyboard and waffles were mostly resting, the white mug is a true explorer.
  • Figure 5: Motion-capture lab. The HOT3D dataset was collected using a motion-capture rig equipped with a few dozens of infrared exocentric OptiTrack cameras and light diffuser panels for illumination variability.
  • ...and 9 more figures