HiSC4D: Human-centered interaction and 4D Scene Capture in Large-scale Space Using Wearable IMUs and LiDAR
Yudi Dai, Zhiyong Wang, Xiping Lin, Chenglu Wen, Lan Xu, Siqi Shen, Yuexin Ma, Cheng Wang
TL;DR
HiSC4D addresses the challenge of capturing long-term, large-scale human interactions and dynamic scenes from an egocentric perspective by fusing body-mounted IMUs with a head-mounted LiDAR and a multi-stage joint optimization framework. The method combines LiDAR-inertial SLAM, dual-person pose processing, and scene-aware physical constraints to produce accurate global motions and scene reconstructions, while mitigating IMU drift without external maps. A dedicated HiSC4D dataset of eight sequences across four large environments, with SMPL annotations and dense scene meshes, provides a valuable benchmark for LiDAR-based 3D human pose estimation from an egocentric view and for social interaction analysis. The results show substantial improvements over IMU-only baselines, enhanced global localization, and plausible interactions in real-world settings, with practical implications for AR/VR, robotics, and autonomous systems.
Abstract
We introduce HiSC4D, a novel Human-centered interaction and 4D Scene Capture method, aimed at accurately and efficiently creating a dynamic digital world, containing large-scale indoor-outdoor scenes, diverse human motions, rich human-human interactions, and human-environment interactions. By utilizing body-mounted IMUs and a head-mounted LiDAR, HiSC4D can capture egocentric human motions in unconstrained space without the need for external devices and pre-built maps. This affords great flexibility and accessibility for human-centered interaction and 4D scene capturing in various environments. Taking into account that IMUs can capture human spatially unrestricted poses but are prone to drifting for long-period using, and while LiDAR is stable for global localization but rough for local positions and orientations, HiSC4D employs a joint optimization method, harmonizing all sensors and utilizing environment cues, yielding promising results for long-term capture in large scenes. To promote research of egocentric human interaction in large scenes and facilitate downstream tasks, we also present a dataset, containing 8 sequences in 4 large scenes (200 to 5,000 $m^2$), providing 36k frames of accurate 4D human motions with SMPL annotations and dynamic scenes, 31k frames of cropped human point clouds, and scene mesh of the environment. A variety of scenarios, such as the basketball gym and commercial street, alongside challenging human motions, such as daily greeting, one-on-one basketball playing, and tour guiding, demonstrate the effectiveness and the generalization ability of HiSC4D. The dataset and code will be publicated on www.lidarhumanmotion.net/hisc4d available for research purposes.
