Table of Contents
Fetching ...

Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig

Patrick Rim, Kun He, Kevin Harris, Braden Copple, Shangchen Han, Sizhe An, Ivan Shugurov, Tomas Hodan, He Wen, Xu Xie

TL;DR

The paper tackles robust 3D hand tracking in unconstrained environments by introducing Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig. It presents a wearable rig with eight exocentric fisheye cameras and two egocentric Quest 3 cameras, plus a marker-less multi-view ego-exo pipeline to generate precise 3D hand poses. It validates ground-truth quality against a high-coverage dome and introduces EgoExo-Hands, a dataset of about 30k annotated frames, highlighting a reduced gap between realism and 3D annotation accuracy. It shows cross-dataset generalization gaps, emphasizing the dataset's difficulty and value as a benchmark for robust hand pose estimation in the wild.

Abstract

Accurate 3D tracking of hands and their interactions with the world in unconstrained settings remains a significant challenge for egocentric computer vision. With few exceptions, existing datasets are predominantly captured in controlled lab setups, limiting environmental diversity and model generalization. To address this, we introduce a novel marker-less multi-camera system designed to capture precise 3D hands and objects, which allows for nearly unconstrained mobility in genuinely in-the-wild conditions. We combine a lightweight, back-mounted capture rig with eight exocentric cameras, and a user-worn Meta Quest 3 headset, which contributes two egocentric views. We design an ego-exo tracking pipeline to generate accurate 3D hand pose ground truth from this system, and rigorously evaluate its quality. By collecting an annotated dataset featuring synchronized multi-view images and precise 3D hand poses, we demonstrate the capability of our approach to significantly reduce the trade-off between environmental realism and 3D annotation accuracy.

Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig

TL;DR

The paper tackles robust 3D hand tracking in unconstrained environments by introducing Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig. It presents a wearable rig with eight exocentric fisheye cameras and two egocentric Quest 3 cameras, plus a marker-less multi-view ego-exo pipeline to generate precise 3D hand poses. It validates ground-truth quality against a high-coverage dome and introduces EgoExo-Hands, a dataset of about 30k annotated frames, highlighting a reduced gap between realism and 3D annotation accuracy. It shows cross-dataset generalization gaps, emphasizing the dataset's difficulty and value as a benchmark for robust hand pose estimation in the wild.

Abstract

Accurate 3D tracking of hands and their interactions with the world in unconstrained settings remains a significant challenge for egocentric computer vision. With few exceptions, existing datasets are predominantly captured in controlled lab setups, limiting environmental diversity and model generalization. To address this, we introduce a novel marker-less multi-camera system designed to capture precise 3D hands and objects, which allows for nearly unconstrained mobility in genuinely in-the-wild conditions. We combine a lightweight, back-mounted capture rig with eight exocentric cameras, and a user-worn Meta Quest 3 headset, which contributes two egocentric views. We design an ego-exo tracking pipeline to generate accurate 3D hand pose ground truth from this system, and rigorously evaluate its quality. By collecting an annotated dataset featuring synchronized multi-view images and precise 3D hand poses, we demonstrate the capability of our approach to significantly reduce the trade-off between environmental realism and 3D annotation accuracy.

Paper Structure

This paper contains 7 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Our mobile multi-camera capture rig. Left: annotated hardware layout showing five OptiTrack motion capture cameras (red), two egocentric headset cameras (blue), and eight exocentric fisheye cameras mounted in a half-dome configuration (green) including top-down, far-corner, protruding, and bottom-up placements. Right: the rig in use during an in-the-wild capture session, demonstrating its lightweight, wearable design that allows natural interaction while maintaining synchronized multi-view coverage.
  • Figure 2: Overview of our multi-stage pipeline for accurate 3D hand pose annotation. (a) Multi-view fisheye images from our 10 ego/exo cameras. (b) Sapiens body keypoint detector predicts 2D hand keypoints per frame and localizes hands for cropping. (c) InterNet hand-specific detector on perspective crops for additional 2D hand keypoints. (d) Left/right 3D hand keypoints are triangulated across all views using robust multi-view geometry. (e) Personalized meshes are fitted to triangulated 3D keypoints and projected back to 2D views.