Mocap Everyone Everywhere: Lightweight Motion Capture With Smartwatches and a Head-Mounted Camera
Jiye Lee, Hanbyul Joo
TL;DR
The paper tackles the problem of affordable, outside-the-lab 3D full-body motion capture by combining two wrist-worn IMUs with a head-mounted monocular camera. It introduces a two-stage Transformer-based estimator that leverages head poses from SLAM, a floor level tracking module to handle nonflat environments, and a motion optimization stage that exploits egocentric visual cues to refine the result within a learned motion manifold. Key contributions include the first full-body mocap from two IMUs plus head-mounted video, a floor-tracking method enabling outdoor and multi-level scenes, and a visual-cue driven optimization that improves accuracy in object interactions and multi-person settings. The approach demonstrates strong performance in indoor/outdoor scenarios, outperforms several IMU-based baselines on root-related metrics, and offers practical potential for scalable, socially aware motion capture with minimal hardware.
Abstract
We present a lightweight and affordable motion capture method based on two smartwatches and a head-mounted camera. In contrast to the existing approaches that use six or more expert-level IMU devices, our approach is much more cost-effective and convenient. Our method can make wearable motion capture accessible to everyone everywhere, enabling 3D full-body motion capture in diverse environments. As a key idea to overcome the extreme sparsity and ambiguities of sensor inputs with different modalities, we integrate 6D head poses obtained from the head-mounted cameras for motion estimation. To enable capture in expansive indoor and outdoor scenes, we propose an algorithm to track and update floor level changes to define head poses, coupled with a multi-stage Transformer-based regression module. We also introduce novel strategies leveraging visual cues of egocentric images to further enhance the motion capture quality while reducing ambiguities. We demonstrate the performance of our method on various challenging scenarios, including complex outdoor environments and everyday motions including object interactions and social interactions among multiple individuals.
