Table of Contents
Fetching ...

SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance

Peishan Cong, Ziyi Wang, Yuexin Ma, Xiangyu Yue

TL;DR

SemGeoMo tackles dynamic contextual human motion generation by integrating textual semantic guidance with hierarchical geometric cues from sequential point clouds. The approach introduces an automated LLM Annotator to generate coarse and fine textual guidance, a SemGeo Hierarchical Guidance stage with dual-branch transformers to produce coarse hand-joint and affordance cues, and a SemGeo-guided Motion Generation stage using Motion ControlNet for high-fidelity full-body motions. Multi-level conditioning combines text features (via CLIP/LONGCLIP) with geometry features (joint positions and affordances) through a SemGeo Condition Module and mutual cross-attention, followed by loss-guided refinement and L-BFGS-based posterior updates. Experiments across FullBodyManipulation, BEHAVE, IMHD$^2$, and unseen HoDome demonstrate state-of-the-art performance and strong generalization to unseen objects and human–human interactions, while ablations confirm the value of textual guidance and coarse-to-fine geometric cues. The work offers a practical framework for realistic interactive motion generation with interpretable language descriptions, enabling improved human–robot interaction and immersive simulations.

Abstract

Generating reasonable and high-quality human interactive motions in a given dynamic environment is crucial for understanding, modeling, transferring, and applying human behaviors to both virtual and physical robots. In this paper, we introduce an effective method, SemGeoMo, for dynamic contextual human motion generation, which fully leverages the text-affordance-joint multi-level semantic and geometric guidance in the generation process, improving the semantic rationality and geometric correctness of generative motions. Our method achieves state-of-the-art performance on three datasets and demonstrates superior generalization capability for diverse interaction scenarios. The project page and code can be found at https://4dvlab.github.io/project_page/semgeomo/.

SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance

TL;DR

SemGeoMo tackles dynamic contextual human motion generation by integrating textual semantic guidance with hierarchical geometric cues from sequential point clouds. The approach introduces an automated LLM Annotator to generate coarse and fine textual guidance, a SemGeo Hierarchical Guidance stage with dual-branch transformers to produce coarse hand-joint and affordance cues, and a SemGeo-guided Motion Generation stage using Motion ControlNet for high-fidelity full-body motions. Multi-level conditioning combines text features (via CLIP/LONGCLIP) with geometry features (joint positions and affordances) through a SemGeo Condition Module and mutual cross-attention, followed by loss-guided refinement and L-BFGS-based posterior updates. Experiments across FullBodyManipulation, BEHAVE, IMHD, and unseen HoDome demonstrate state-of-the-art performance and strong generalization to unseen objects and human–human interactions, while ablations confirm the value of textual guidance and coarse-to-fine geometric cues. The work offers a practical framework for realistic interactive motion generation with interpretable language descriptions, enabling improved human–robot interaction and immersive simulations.

Abstract

Generating reasonable and high-quality human interactive motions in a given dynamic environment is crucial for understanding, modeling, transferring, and applying human behaviors to both virtual and physical robots. In this paper, we introduce an effective method, SemGeoMo, for dynamic contextual human motion generation, which fully leverages the text-affordance-joint multi-level semantic and geometric guidance in the generation process, improving the semantic rationality and geometric correctness of generative motions. Our method achieves state-of-the-art performance on three datasets and demonstrates superior generalization capability for diverse interaction scenarios. The project page and code can be found at https://4dvlab.github.io/project_page/semgeomo/.

Paper Structure

This paper contains 24 sections, 7 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Given sequential point clouds of interactive targets, SemGeoMo generates realistic and high-quality human interactive motions along with corresponding textual descriptions. By leveraging both semantic and geometric guidance, our method ensures the semantic coherence and geometric accuracy of the generated results.
  • Figure 2: The pipeline of our two-stage framework. LLM Annotator provides the semantic guidance. SemGeo Hierarchical Guidance Generation takes textual information and sequential point cloud as condition and generate affordance-level and joint-level guidance. Then SemGeo-guided Motion Generation utlizes semantic and geometric information to generate responsive human motion.
  • Figure 3: LLM Annotator pipeline. It first takes a sequential point cloud to infer a coarse text description. Then uses joint positions, the coarse text, and geometric features to generate a fine-grained sentence with a designed prompt.
  • Figure 4: Qualitative results on the FullBodyManipulation dataset. We circle areas of low contact performance in pink and instances of contorted motion in green.
  • Figure 5: Visulization on more datasets.
  • ...and 4 more figures