Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig
Patrick Rim, Kun He, Kevin Harris, Braden Copple, Shangchen Han, Sizhe An, Ivan Shugurov, Tomas Hodan, He Wen, Xu Xie
TL;DR
The paper tackles robust 3D hand tracking in unconstrained environments by introducing Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig. It presents a wearable rig with eight exocentric fisheye cameras and two egocentric Quest 3 cameras, plus a marker-less multi-view ego-exo pipeline to generate precise 3D hand poses. It validates ground-truth quality against a high-coverage dome and introduces EgoExo-Hands, a dataset of about 30k annotated frames, highlighting a reduced gap between realism and 3D annotation accuracy. It shows cross-dataset generalization gaps, emphasizing the dataset's difficulty and value as a benchmark for robust hand pose estimation in the wild.
Abstract
Accurate 3D tracking of hands and their interactions with the world in unconstrained settings remains a significant challenge for egocentric computer vision. With few exceptions, existing datasets are predominantly captured in controlled lab setups, limiting environmental diversity and model generalization. To address this, we introduce a novel marker-less multi-camera system designed to capture precise 3D hands and objects, which allows for nearly unconstrained mobility in genuinely in-the-wild conditions. We combine a lightweight, back-mounted capture rig with eight exocentric cameras, and a user-worn Meta Quest 3 headset, which contributes two egocentric views. We design an ego-exo tracking pipeline to generate accurate 3D hand pose ground truth from this system, and rigorously evaluate its quality. By collecting an annotated dataset featuring synchronized multi-view images and precise 3D hand poses, we demonstrate the capability of our approach to significantly reduce the trade-off between environmental realism and 3D annotation accuracy.
