Table of Contents
Fetching ...

EMHI: A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs

Zhen Fan, Peng Dai, Zhuo Su, Xu Gao, Zheng Lv, Jiarui Zhang, Tianyuan Du, Guidong Wang, Yang Zhang

TL;DR

This work tackles the challenge of accurate egocentric human pose estimation by marrying real-world stereo egocentric vision with body-worn IMUs. It introduces EMHI, a large-scale multimodal dataset captured on actual VR hardware, providing synchronized imagery, IMU signals, and SMPL ground-truth across diverse subjects, actions, and lighting. It also proposes MEPoser, a baseline that fuses multimodal inputs with a temporal encoder and SMPL decoder to achieve real-time, improved pose estimation on an HMD, outperforming single-modal approaches. Together, EMHI and MEPoser enable more robust, deployable egocentric HPE for VR/AR applications and facilitate future multimodal research on wearable sensors.

Abstract

Egocentric human pose estimation (HPE) using wearable sensors is essential for VR/AR applications. Most methods rely solely on either egocentric-view images or sparse Inertial Measurement Unit (IMU) signals, leading to inaccuracies due to self-occlusion in images or the sparseness and drift of inertial sensors. Most importantly, the lack of real-world datasets containing both modalities is a major obstacle to progress in this field. To overcome the barrier, we propose EMHI, a multimodal \textbf{E}gocentric human \textbf{M}otion dataset with \textbf{H}ead-Mounted Display (HMD) and body-worn \textbf{I}MUs, with all data collected under the real VR product suite. Specifically, EMHI provides synchronized stereo images from downward-sloping cameras on the headset and IMU data from body-worn sensors, along with pose annotations in SMPL format. This dataset consists of 885 sequences captured by 58 subjects performing 39 actions, totaling about 28.5 hours of recording. We evaluate the annotations by comparing them with optical marker-based SMPL fitting results. To substantiate the reliability of our dataset, we introduce MEPoser, a new baseline method for multimodal egocentric HPE, which employs a multimodal fusion encoder, temporal feature encoder, and MLP-based regression heads. The experiments on EMHI show that MEPoser outperforms existing single-modal methods and demonstrates the value of our dataset in solving the problem of egocentric HPE. We believe the release of EMHI and the method could advance the research of egocentric HPE and expedite the practical implementation of this technology in VR/AR products.

EMHI: A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs

TL;DR

This work tackles the challenge of accurate egocentric human pose estimation by marrying real-world stereo egocentric vision with body-worn IMUs. It introduces EMHI, a large-scale multimodal dataset captured on actual VR hardware, providing synchronized imagery, IMU signals, and SMPL ground-truth across diverse subjects, actions, and lighting. It also proposes MEPoser, a baseline that fuses multimodal inputs with a temporal encoder and SMPL decoder to achieve real-time, improved pose estimation on an HMD, outperforming single-modal approaches. Together, EMHI and MEPoser enable more robust, deployable egocentric HPE for VR/AR applications and facilitate future multimodal research on wearable sensors.

Abstract

Egocentric human pose estimation (HPE) using wearable sensors is essential for VR/AR applications. Most methods rely solely on either egocentric-view images or sparse Inertial Measurement Unit (IMU) signals, leading to inaccuracies due to self-occlusion in images or the sparseness and drift of inertial sensors. Most importantly, the lack of real-world datasets containing both modalities is a major obstacle to progress in this field. To overcome the barrier, we propose EMHI, a multimodal \textbf{E}gocentric human \textbf{M}otion dataset with \textbf{H}ead-Mounted Display (HMD) and body-worn \textbf{I}MUs, with all data collected under the real VR product suite. Specifically, EMHI provides synchronized stereo images from downward-sloping cameras on the headset and IMU data from body-worn sensors, along with pose annotations in SMPL format. This dataset consists of 885 sequences captured by 58 subjects performing 39 actions, totaling about 28.5 hours of recording. We evaluate the annotations by comparing them with optical marker-based SMPL fitting results. To substantiate the reliability of our dataset, we introduce MEPoser, a new baseline method for multimodal egocentric HPE, which employs a multimodal fusion encoder, temporal feature encoder, and MLP-based regression heads. The experiments on EMHI show that MEPoser outperforms existing single-modal methods and demonstrates the value of our dataset in solving the problem of egocentric HPE. We believe the release of EMHI and the method could advance the research of egocentric HPE and expedite the practical implementation of this technology in VR/AR products.
Paper Structure (22 sections, 3 equations, 4 figures, 3 tables)

This paper contains 22 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: EMHI is a multimodal dataset that provides (a) stereo egocentric images and (b) IMU signals.The annotations include (c) 2D keypoints overlaid on egocentric images and (d) SMPL parameters in the world coordination. Each sequence is also annotated with (e) the action label, as well as (f) individual attributes such as height, BMI, and clothing descriptions.
  • Figure 2: Hardware setup and ground-truth acquisition pipeline. (a) the data capture system consists of EgoSenorKit for egocentric images and calibrated IMU signals collection, eight Azure Kinects for multiple third-view image recording and an Optitrack system for spatiotemporal synchronization of the above signals. With the data collected in (b), (c) we produce the annotations including SMPL parameters and 2D keypoints on egocentric images automatically.
  • Figure 3: MEPoser. The proposed method consists of a multimodal fusion encoder for feature extracting and fusion of input signals, a temporal feature encoder for history information association, and an SMPL decoder for SMPL parameters prediction.
  • Figure 4: Qualitative comparison between ours and single-modal methods (GT: green, estimation: red). Ours relieves joint invisibility in egocentric images (purple), IMU data drifting (yellow), and ambiguous measurements in slow motions(gray).