Table of Contents
Fetching ...

HiSC4D: Human-centered interaction and 4D Scene Capture in Large-scale Space Using Wearable IMUs and LiDAR

Yudi Dai, Zhiyong Wang, Xiping Lin, Chenglu Wen, Lan Xu, Siqi Shen, Yuexin Ma, Cheng Wang

TL;DR

HiSC4D addresses the challenge of capturing long-term, large-scale human interactions and dynamic scenes from an egocentric perspective by fusing body-mounted IMUs with a head-mounted LiDAR and a multi-stage joint optimization framework. The method combines LiDAR-inertial SLAM, dual-person pose processing, and scene-aware physical constraints to produce accurate global motions and scene reconstructions, while mitigating IMU drift without external maps. A dedicated HiSC4D dataset of eight sequences across four large environments, with SMPL annotations and dense scene meshes, provides a valuable benchmark for LiDAR-based 3D human pose estimation from an egocentric view and for social interaction analysis. The results show substantial improvements over IMU-only baselines, enhanced global localization, and plausible interactions in real-world settings, with practical implications for AR/VR, robotics, and autonomous systems.

Abstract

We introduce HiSC4D, a novel Human-centered interaction and 4D Scene Capture method, aimed at accurately and efficiently creating a dynamic digital world, containing large-scale indoor-outdoor scenes, diverse human motions, rich human-human interactions, and human-environment interactions. By utilizing body-mounted IMUs and a head-mounted LiDAR, HiSC4D can capture egocentric human motions in unconstrained space without the need for external devices and pre-built maps. This affords great flexibility and accessibility for human-centered interaction and 4D scene capturing in various environments. Taking into account that IMUs can capture human spatially unrestricted poses but are prone to drifting for long-period using, and while LiDAR is stable for global localization but rough for local positions and orientations, HiSC4D employs a joint optimization method, harmonizing all sensors and utilizing environment cues, yielding promising results for long-term capture in large scenes. To promote research of egocentric human interaction in large scenes and facilitate downstream tasks, we also present a dataset, containing 8 sequences in 4 large scenes (200 to 5,000 $m^2$), providing 36k frames of accurate 4D human motions with SMPL annotations and dynamic scenes, 31k frames of cropped human point clouds, and scene mesh of the environment. A variety of scenarios, such as the basketball gym and commercial street, alongside challenging human motions, such as daily greeting, one-on-one basketball playing, and tour guiding, demonstrate the effectiveness and the generalization ability of HiSC4D. The dataset and code will be publicated on www.lidarhumanmotion.net/hisc4d available for research purposes.

HiSC4D: Human-centered interaction and 4D Scene Capture in Large-scale Space Using Wearable IMUs and LiDAR

TL;DR

HiSC4D addresses the challenge of capturing long-term, large-scale human interactions and dynamic scenes from an egocentric perspective by fusing body-mounted IMUs with a head-mounted LiDAR and a multi-stage joint optimization framework. The method combines LiDAR-inertial SLAM, dual-person pose processing, and scene-aware physical constraints to produce accurate global motions and scene reconstructions, while mitigating IMU drift without external maps. A dedicated HiSC4D dataset of eight sequences across four large environments, with SMPL annotations and dense scene meshes, provides a valuable benchmark for LiDAR-based 3D human pose estimation from an egocentric view and for social interaction analysis. The results show substantial improvements over IMU-only baselines, enhanced global localization, and plausible interactions in real-world settings, with practical implications for AR/VR, robotics, and autonomous systems.

Abstract

We introduce HiSC4D, a novel Human-centered interaction and 4D Scene Capture method, aimed at accurately and efficiently creating a dynamic digital world, containing large-scale indoor-outdoor scenes, diverse human motions, rich human-human interactions, and human-environment interactions. By utilizing body-mounted IMUs and a head-mounted LiDAR, HiSC4D can capture egocentric human motions in unconstrained space without the need for external devices and pre-built maps. This affords great flexibility and accessibility for human-centered interaction and 4D scene capturing in various environments. Taking into account that IMUs can capture human spatially unrestricted poses but are prone to drifting for long-period using, and while LiDAR is stable for global localization but rough for local positions and orientations, HiSC4D employs a joint optimization method, harmonizing all sensors and utilizing environment cues, yielding promising results for long-term capture in large scenes. To promote research of egocentric human interaction in large scenes and facilitate downstream tasks, we also present a dataset, containing 8 sequences in 4 large scenes (200 to 5,000 ), providing 36k frames of accurate 4D human motions with SMPL annotations and dynamic scenes, 31k frames of cropped human point clouds, and scene mesh of the environment. A variety of scenarios, such as the basketball gym and commercial street, alongside challenging human motions, such as daily greeting, one-on-one basketball playing, and tour guiding, demonstrate the effectiveness and the generalization ability of HiSC4D. The dataset and code will be publicated on www.lidarhumanmotion.net/hisc4d available for research purposes.
Paper Structure (21 sections, 27 equations, 10 figures, 5 tables)

This paper contains 21 sections, 27 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: HiSC4D enables the capture of interacting human motions and large-scale scenes with a person equipped with a head-mounted LiDAR and both the first and the second person wearing IMUs. The LiDAR not only scans the surrounding environment but also captures detailed 3D annotations of the second person, ensuring accurate spatial information.
  • Figure 2: The pipeline of HiSC4D. The HiSC4D pipeline consists of LiDAR mapping, first-person motion processing, second-person motion processing, and a multi-stage joint optimization process. This comprehensive pipeline enables the capture and reconstruction of the dynamics of two humans and the scene, resulting in accurate localization and natural human interactions in large-scale environments.
  • Figure 3: The mult-stage optimization pipeline for dual-person motions. Stage 1: optimizing the global translation $T$ only. Stage 2: optimizing $T$ and the global rotation $R$. Stage 3: optimizing $T$, $R$, and $\theta$. The pipeline takes $P_{1:n}$, initial human motions $M_{1:n}$, and the scene $S$ as input, outputs accurate and scene-plausible motions in large environments.
  • Figure 4: The vertices to point constraint pipeline. First, we remove the second person's (red) invisible vertices from the LiDAR view, then resample the visible SMPL vertices to align the LiDAR's resolution. Finally, we apply the $\mathcal{L}_{v2p}$ to the second person.
  • Figure 5: Design of the capturing system: Both individuals are equipped with 17 body-attached IMU sensors. The first person also wears a head-mounted LiDAR sensor and a backpack, which houses receivers for both sets of IMUs, a mini-computer, and a mobile power bank.
  • ...and 5 more figures