Table of Contents
Fetching ...

EgoHDM: An Online Egocentric-Inertial Human Motion Capture, Localization, and Dense Mapping System

Bonan Liu, Handi Yin, Manuel Kaufmann, Jinhao He, Sammy Christen, Jie Song, Pan Hui

TL;DR

EgoHDM is the first human mocap system that offers dense scene mapping in near real-time and is fast and robust to initialize and fully closes the loop between physically plausible map-aware global human motion estimation and mocap-aware 3D scene reconstruction.

Abstract

We present EgoHDM, an online egocentric-inertial human motion capture (mocap), localization, and dense mapping system. Our system uses 6 inertial measurement units (IMUs) and a commodity head-mounted RGB camera. EgoHDM is the first human mocap system that offers dense scene mapping in near real-time. Further, it is fast and robust to initialize and fully closes the loop between physically plausible map-aware global human motion estimation and mocap-aware 3D scene reconstruction. Our key idea is integrating camera localization and mapping information with inertial human motion capture bidirectionally in our system. To achieve this, we design a tightly coupled mocap-aware dense bundle adjustment and physics-based body pose correction module leveraging a local body-centric elevation map. The latter introduces a novel terrain-aware contact PD controller, which enables characters to physically contact the given local elevation map thereby reducing human floating or penetration. We demonstrate the performance of our system on established synthetic and real-world benchmarks. The results show that our method reduces human localization, camera pose, and mapping accuracy error by 41%, 71%, 46%, respectively, compared to the state of the art. Our qualitative evaluations on newly captured data further demonstrate that EgoHDM can cover challenging scenarios in non-flat terrain including stepping over stairs and outdoor scenes in the wild.

EgoHDM: An Online Egocentric-Inertial Human Motion Capture, Localization, and Dense Mapping System

TL;DR

EgoHDM is the first human mocap system that offers dense scene mapping in near real-time and is fast and robust to initialize and fully closes the loop between physically plausible map-aware global human motion estimation and mocap-aware 3D scene reconstruction.

Abstract

We present EgoHDM, an online egocentric-inertial human motion capture (mocap), localization, and dense mapping system. Our system uses 6 inertial measurement units (IMUs) and a commodity head-mounted RGB camera. EgoHDM is the first human mocap system that offers dense scene mapping in near real-time. Further, it is fast and robust to initialize and fully closes the loop between physically plausible map-aware global human motion estimation and mocap-aware 3D scene reconstruction. Our key idea is integrating camera localization and mapping information with inertial human motion capture bidirectionally in our system. To achieve this, we design a tightly coupled mocap-aware dense bundle adjustment and physics-based body pose correction module leveraging a local body-centric elevation map. The latter introduces a novel terrain-aware contact PD controller, which enables characters to physically contact the given local elevation map thereby reducing human floating or penetration. We demonstrate the performance of our system on established synthetic and real-world benchmarks. The results show that our method reduces human localization, camera pose, and mapping accuracy error by 41%, 71%, 46%, respectively, compared to the state of the art. Our qualitative evaluations on newly captured data further demonstrate that EgoHDM can cover challenging scenarios in non-flat terrain including stepping over stairs and outdoor scenes in the wild.
Paper Structure (25 sections, 6 equations, 6 figures, 5 tables)

This paper contains 25 sections, 6 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of EgoHDM. The inputs to EgoHDM are real-time acceleration and orientation measurements from six body-worn IMUs and monocular egocentric RGB images. We first initialize the system (VIM Initialization, Sec. \ref{['init']}) by finding a similarity transform $\mathbf{T}_{hc}$ that aligns inertial and camera frames with accurate scale found by leveraging body shape constraints. After initialization, the mocap-aware dense bundle adjustment (MDBA, Sec. \ref{['DBAH']}) jointly optimizes camera poses and depth images of keyframes by integrating inertial human motion constraints with RGB-based SLAM teed2021droid. We then construct and maintain a consistent, dense 3D map with global BA and loop closing (Sec. \ref{['mapping']}). To reduce the depth noise influence in our global map, covariance-guided volumetric fusion is employed sig. Next, we create a local body-centric elevation map with a fixed resolution by projecting the global map along the direction of gravity (Sec. \ref{['localmap']}). Lastly, in the map-aware inertial mocap module (Sec. \ref{['mocap']}), we refine poses provided by an inertial learning-based pose estimator yi2022physical by introducing a physics-based correction module that leverages the elevation map to establish foot-to-ground contact. The corrected poses are fed back to the MDBA, thereby fully closing the loop between inertial-based pose estimation and SLAM-based mapping.
  • Figure 2: Qualitative comparisons on HPS dataset with EgoLocate. We note that EgoLocate estimations can penetrate the floor or float unrealistically, whereas our method estimates more accurate floor contacts, even in the challenging case of the human lying on the floor.
  • Figure 3: Qualitative comparisons on synthetic TotalCapture with PIP (inertial-only) and EgoLocate (inertial + sparse SLAM). The dense map shown in the figure is reconstructed online by our system. The blue square represents the elevation map. Our results follow the ground-truth more closely than either baseline.
  • Figure 4: Qualitative comparisons of mapping accuracy with offline Droid-SLAM and EgoLocate on synthetic TotalCapture. For Droid-SLAM, we align the scale with the ground-truth trajectory from the first 8 keyframes. Blue indicates low, red high error (> 1 meter). Note that even for the challenging "Flooded Grounds" scene, our method provides robust mapping of the terrain.
  • Figure 5: Ablation study in terms of mapping accuracy on our newly captured scenes with terrain height changes. Errors above $1.0$ m are clipped and excess geometry discarded. The point-to-point error distribution, drawn next to the color bar, reveals that our full system's error is primarily centered around a low near-zero mean. The absence of foot-ground constraints in the VIM initialization (2nd column) and the lack of mocap constraints in the MDBA module (3rd column) lead to increased mapping bias and scale uncertainty, thus driving up the average error and its variance.
  • ...and 1 more figures