LaserHuman: Language-guided Scene-aware Human Motion Generation in Free Environment
Peishan Cong, Ziyi Wang, Zhiyang Dou, Yiming Ren, Wei Yin, Kai Cheng, Yujing Sun, Xiaoxiao Long, Xinge Zhu, Yuexin Ma
TL;DR
LaserHuman tackles the Scene-Text-to-Motion challenge by introducing a real-world, multi-modal dataset of 3D scenes and free-form language, paired with a diffusion-based model that fuses scene and language cues to generate plausible human motions. The method employs a multi-condition fusion module that integrates point-cloud scene features and CLIP language representations to guide a diffusion process over SMPL pose sequences, achieving state-of-the-art performance on LaserHuman and existing benchmarks. Extensive experiments and user studies demonstrate improvements in textual alignment, spatial plausibility, and motion diversity across indoor and outdoor dynamic environments, with ablations confirming the value of parallel cross-modal fusion. The work advances data and methodology for practical applications in simulation, animation, robotics, and human-scene interaction research, while also outlining avenues for improved physical fidelity and dynamic scene handling.
Abstract
Language-guided scene-aware human motion generation has great significance for entertainment and robotics. In response to the limitations of existing datasets, we introduce LaserHuman, a pioneering dataset engineered to revolutionize Scene-Text-to-Motion research. LaserHuman stands out with its inclusion of genuine human motions within 3D environments, unbounded free-form natural language descriptions, a blend of indoor and outdoor scenarios, and dynamic, ever-changing scenes. Diverse modalities of capture data and rich annotations present great opportunities for the research of conditional motion generation, and can also facilitate the development of real-life applications. Moreover, to generate semantically consistent and physically plausible human motions, we propose a multi-conditional diffusion model, which is simple but effective, achieving state-of-the-art performance on existing datasets.
