EMHI: A Multimodal Egocentric Human Motion Dataset with HMD and Body-Worn IMUs
Zhen Fan, Peng Dai, Zhuo Su, Xu Gao, Zheng Lv, Jiarui Zhang, Tianyuan Du, Guidong Wang, Yang Zhang
TL;DR
This work tackles the challenge of accurate egocentric human pose estimation by marrying real-world stereo egocentric vision with body-worn IMUs. It introduces EMHI, a large-scale multimodal dataset captured on actual VR hardware, providing synchronized imagery, IMU signals, and SMPL ground-truth across diverse subjects, actions, and lighting. It also proposes MEPoser, a baseline that fuses multimodal inputs with a temporal encoder and SMPL decoder to achieve real-time, improved pose estimation on an HMD, outperforming single-modal approaches. Together, EMHI and MEPoser enable more robust, deployable egocentric HPE for VR/AR applications and facilitate future multimodal research on wearable sensors.
Abstract
Egocentric human pose estimation (HPE) using wearable sensors is essential for VR/AR applications. Most methods rely solely on either egocentric-view images or sparse Inertial Measurement Unit (IMU) signals, leading to inaccuracies due to self-occlusion in images or the sparseness and drift of inertial sensors. Most importantly, the lack of real-world datasets containing both modalities is a major obstacle to progress in this field. To overcome the barrier, we propose EMHI, a multimodal \textbf{E}gocentric human \textbf{M}otion dataset with \textbf{H}ead-Mounted Display (HMD) and body-worn \textbf{I}MUs, with all data collected under the real VR product suite. Specifically, EMHI provides synchronized stereo images from downward-sloping cameras on the headset and IMU data from body-worn sensors, along with pose annotations in SMPL format. This dataset consists of 885 sequences captured by 58 subjects performing 39 actions, totaling about 28.5 hours of recording. We evaluate the annotations by comparing them with optical marker-based SMPL fitting results. To substantiate the reliability of our dataset, we introduce MEPoser, a new baseline method for multimodal egocentric HPE, which employs a multimodal fusion encoder, temporal feature encoder, and MLP-based regression heads. The experiments on EMHI show that MEPoser outperforms existing single-modal methods and demonstrates the value of our dataset in solving the problem of egocentric HPE. We believe the release of EMHI and the method could advance the research of egocentric HPE and expedite the practical implementation of this technology in VR/AR products.
