Table of Contents
Fetching ...

LaserHuman: Language-guided Scene-aware Human Motion Generation in Free Environment

Peishan Cong, Ziyi Wang, Zhiyang Dou, Yiming Ren, Wei Yin, Kai Cheng, Yujing Sun, Xiaoxiao Long, Xinge Zhu, Yuexin Ma

TL;DR

LaserHuman tackles the Scene-Text-to-Motion challenge by introducing a real-world, multi-modal dataset of 3D scenes and free-form language, paired with a diffusion-based model that fuses scene and language cues to generate plausible human motions. The method employs a multi-condition fusion module that integrates point-cloud scene features and CLIP language representations to guide a diffusion process over SMPL pose sequences, achieving state-of-the-art performance on LaserHuman and existing benchmarks. Extensive experiments and user studies demonstrate improvements in textual alignment, spatial plausibility, and motion diversity across indoor and outdoor dynamic environments, with ablations confirming the value of parallel cross-modal fusion. The work advances data and methodology for practical applications in simulation, animation, robotics, and human-scene interaction research, while also outlining avenues for improved physical fidelity and dynamic scene handling.

Abstract

Language-guided scene-aware human motion generation has great significance for entertainment and robotics. In response to the limitations of existing datasets, we introduce LaserHuman, a pioneering dataset engineered to revolutionize Scene-Text-to-Motion research. LaserHuman stands out with its inclusion of genuine human motions within 3D environments, unbounded free-form natural language descriptions, a blend of indoor and outdoor scenarios, and dynamic, ever-changing scenes. Diverse modalities of capture data and rich annotations present great opportunities for the research of conditional motion generation, and can also facilitate the development of real-life applications. Moreover, to generate semantically consistent and physically plausible human motions, we propose a multi-conditional diffusion model, which is simple but effective, achieving state-of-the-art performance on existing datasets.

LaserHuman: Language-guided Scene-aware Human Motion Generation in Free Environment

TL;DR

LaserHuman tackles the Scene-Text-to-Motion challenge by introducing a real-world, multi-modal dataset of 3D scenes and free-form language, paired with a diffusion-based model that fuses scene and language cues to generate plausible human motions. The method employs a multi-condition fusion module that integrates point-cloud scene features and CLIP language representations to guide a diffusion process over SMPL pose sequences, achieving state-of-the-art performance on LaserHuman and existing benchmarks. Extensive experiments and user studies demonstrate improvements in textual alignment, spatial plausibility, and motion diversity across indoor and outdoor dynamic environments, with ablations confirming the value of parallel cross-modal fusion. The work advances data and methodology for practical applications in simulation, animation, robotics, and human-scene interaction research, while also outlining avenues for improved physical fidelity and dynamic scene handling.

Abstract

Language-guided scene-aware human motion generation has great significance for entertainment and robotics. In response to the limitations of existing datasets, we introduce LaserHuman, a pioneering dataset engineered to revolutionize Scene-Text-to-Motion research. LaserHuman stands out with its inclusion of genuine human motions within 3D environments, unbounded free-form natural language descriptions, a blend of indoor and outdoor scenarios, and dynamic, ever-changing scenes. Diverse modalities of capture data and rich annotations present great opportunities for the research of conditional motion generation, and can also facilitate the development of real-life applications. Moreover, to generate semantically consistent and physically plausible human motions, we propose a multi-conditional diffusion model, which is simple but effective, achieving state-of-the-art performance on existing datasets.
Paper Structure (26 sections, 4 equations, 9 figures, 6 tables)

This paper contains 26 sections, 4 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 2: Overview of data collection and processing procedures.
  • Figure 3: A gallery of LaserHuman, containing diverse scenarios, rich interactions, and abundant free-form language descriptions, where colored humans are annotated targets and white humans are interacted humans in the dynamic scene.
  • Figure 4: The pipeline of our generative model, which is applicable for language-guided scene-aware human motion generation. We demonstrate details of the multi-condition fusion module.
  • Figure 5: Generation results on LaserHuman. The human mesh color from light to dark represents an increase in timing and pink human are corresponding interacted humans in the dynamic scene.
  • Figure 6: More generation results of our method on LaserHuman.
  • ...and 4 more figures