Table of Contents
Fetching ...

Mocap Everyone Everywhere: Lightweight Motion Capture With Smartwatches and a Head-Mounted Camera

Jiye Lee, Hanbyul Joo

TL;DR

The paper tackles the problem of affordable, outside-the-lab 3D full-body motion capture by combining two wrist-worn IMUs with a head-mounted monocular camera. It introduces a two-stage Transformer-based estimator that leverages head poses from SLAM, a floor level tracking module to handle nonflat environments, and a motion optimization stage that exploits egocentric visual cues to refine the result within a learned motion manifold. Key contributions include the first full-body mocap from two IMUs plus head-mounted video, a floor-tracking method enabling outdoor and multi-level scenes, and a visual-cue driven optimization that improves accuracy in object interactions and multi-person settings. The approach demonstrates strong performance in indoor/outdoor scenarios, outperforms several IMU-based baselines on root-related metrics, and offers practical potential for scalable, socially aware motion capture with minimal hardware.

Abstract

We present a lightweight and affordable motion capture method based on two smartwatches and a head-mounted camera. In contrast to the existing approaches that use six or more expert-level IMU devices, our approach is much more cost-effective and convenient. Our method can make wearable motion capture accessible to everyone everywhere, enabling 3D full-body motion capture in diverse environments. As a key idea to overcome the extreme sparsity and ambiguities of sensor inputs with different modalities, we integrate 6D head poses obtained from the head-mounted cameras for motion estimation. To enable capture in expansive indoor and outdoor scenes, we propose an algorithm to track and update floor level changes to define head poses, coupled with a multi-stage Transformer-based regression module. We also introduce novel strategies leveraging visual cues of egocentric images to further enhance the motion capture quality while reducing ambiguities. We demonstrate the performance of our method on various challenging scenarios, including complex outdoor environments and everyday motions including object interactions and social interactions among multiple individuals.

Mocap Everyone Everywhere: Lightweight Motion Capture With Smartwatches and a Head-Mounted Camera

TL;DR

The paper tackles the problem of affordable, outside-the-lab 3D full-body motion capture by combining two wrist-worn IMUs with a head-mounted monocular camera. It introduces a two-stage Transformer-based estimator that leverages head poses from SLAM, a floor level tracking module to handle nonflat environments, and a motion optimization stage that exploits egocentric visual cues to refine the result within a learned motion manifold. Key contributions include the first full-body mocap from two IMUs plus head-mounted video, a floor-tracking method enabling outdoor and multi-level scenes, and a visual-cue driven optimization that improves accuracy in object interactions and multi-person settings. The approach demonstrates strong performance in indoor/outdoor scenarios, outperforms several IMU-based baselines on root-related metrics, and offers practical potential for scalable, socially aware motion capture with minimal hardware.

Abstract

We present a lightweight and affordable motion capture method based on two smartwatches and a head-mounted camera. In contrast to the existing approaches that use six or more expert-level IMU devices, our approach is much more cost-effective and convenient. Our method can make wearable motion capture accessible to everyone everywhere, enabling 3D full-body motion capture in diverse environments. As a key idea to overcome the extreme sparsity and ambiguities of sensor inputs with different modalities, we integrate 6D head poses obtained from the head-mounted cameras for motion estimation. To enable capture in expansive indoor and outdoor scenes, we propose an algorithm to track and update floor level changes to define head poses, coupled with a multi-stage Transformer-based regression module. We also introduce novel strategies leveraging visual cues of egocentric images to further enhance the motion capture quality while reducing ambiguities. We demonstrate the performance of our method on various challenging scenarios, including complex outdoor environments and everyday motions including object interactions and social interactions among multiple individuals.
Paper Structure (27 sections, 9 equations, 11 figures, 7 tables)

This paper contains 27 sections, 9 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: We present a lightweight and affordable motion capture method based on two smartwatches and a head-mounted camera.
  • Figure 2: System Overview
  • Figure 3: (a) Visualization of input signals. (b) Visualization of the updated floor levels $f_t$.
  • Figure 4: Network architecture of submodules $\mathcal{F}^{end}$ (left) and $\mathcal{F}^{body}$ (right) of $\mathcal{F}_{est}$.
  • Figure 5: Comparison with $\mathcal{F}_{est}$ without floor update (left) and with floor update (right). The floor update algorithm corrects head height for estimating accurate poses.
  • ...and 6 more figures