Table of Contents
Fetching ...

Jointly Understand Your Command and Intention:Reciprocal Co-Evolution between Scene-Aware 3D Human Motion Synthesis and Analysis

Xuehao Gao, Yang Yang, Shaoyi Du, Guo-Jun Qi, Junwei Han

TL;DR

This work addresses the problem of producing realistic, text-conditioned 3D human motions within complex indoor scenes while also enabling semantic analysis of those motions. It introduces CESA, a co-evolving synthesis-analysis pipeline that couples a cascaded goal–path–pose motion generator with a scene-aware motion analyzer, trained jointly to mutually improve synthesis realism and semantic fidelity. Through extensive experiments on multiple datasets, CESA demonstrates significant improvements in motion realism, text-motion consistency, and scene compatibility, with strong qualitative results and robust ablations validating the cascaded design and mutual learning. The approach offers a practical pathway to more controllable and semantically grounded human motion generation for VR, animation, and human-robot interaction applications.

Abstract

As two intimate reciprocal tasks, scene-aware human motion synthesis and analysis require a joint understanding between multiple modalities, including 3D body motions, 3D scenes, and textual descriptions. In this paper, we integrate these two paired processes into a Co-Evolving Synthesis-Analysis (CESA) pipeline and mutually benefit their learning. Specifically, scene-aware text-to-human synthesis generates diverse indoor motion samples from the same textual description to enrich human-scene interaction intra-class diversity, thus significantly benefiting training a robust human motion analysis system. Reciprocally, human motion analysis would enforce semantic scrutiny on each synthesized motion sample to ensure its semantic consistency with the given textual description, thus improving realistic motion synthesis. Considering that real-world indoor human motions are goal-oriented and path-guided, we propose a cascaded generation strategy that factorizes text-driven scene-specific human motion generation into three stages: goal inferring, path planning, and pose synthesizing. Coupling CESA with this powerful cascaded motion synthesis model, we jointly improve realistic human motion synthesis and robust human motion analysis in 3D scenes.

Jointly Understand Your Command and Intention:Reciprocal Co-Evolution between Scene-Aware 3D Human Motion Synthesis and Analysis

TL;DR

This work addresses the problem of producing realistic, text-conditioned 3D human motions within complex indoor scenes while also enabling semantic analysis of those motions. It introduces CESA, a co-evolving synthesis-analysis pipeline that couples a cascaded goal–path–pose motion generator with a scene-aware motion analyzer, trained jointly to mutually improve synthesis realism and semantic fidelity. Through extensive experiments on multiple datasets, CESA demonstrates significant improvements in motion realism, text-motion consistency, and scene compatibility, with strong qualitative results and robust ablations validating the cascaded design and mutual learning. The approach offers a practical pathway to more controllable and semantically grounded human motion generation for VR, animation, and human-robot interaction applications.

Abstract

As two intimate reciprocal tasks, scene-aware human motion synthesis and analysis require a joint understanding between multiple modalities, including 3D body motions, 3D scenes, and textual descriptions. In this paper, we integrate these two paired processes into a Co-Evolving Synthesis-Analysis (CESA) pipeline and mutually benefit their learning. Specifically, scene-aware text-to-human synthesis generates diverse indoor motion samples from the same textual description to enrich human-scene interaction intra-class diversity, thus significantly benefiting training a robust human motion analysis system. Reciprocally, human motion analysis would enforce semantic scrutiny on each synthesized motion sample to ensure its semantic consistency with the given textual description, thus improving realistic motion synthesis. Considering that real-world indoor human motions are goal-oriented and path-guided, we propose a cascaded generation strategy that factorizes text-driven scene-specific human motion generation into three stages: goal inferring, path planning, and pose synthesizing. Coupling CESA with this powerful cascaded motion synthesis model, we jointly improve realistic human motion synthesis and robust human motion analysis in 3D scenes.

Paper Structure

This paper contains 34 sections, 13 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Co-evolving Synthesis-Analysis Pipeline. Scene-aware text-to-motion generation synthesizes 3D indoor human poses conditioned on the given text commands and 3D scenes. Given a synthesized human-scene interaction sample, scene-aware motion analysis recognizes its human action and interaction object categories.
  • Figure 2: Non-one-on-one corresponding in text-to-motion synthesis in 3D scenes. In some scene contexts, more than one indoor goal/path/pose may both conform to the description in the given textual command.
  • Figure 3: Core Components. Given encoded text feature $\boldsymbol{f}_{T}$ and scene feature $\boldsymbol{f}_{S}$, text-to-motion generator parameterizes a set of Gaussian latent spaces from $\boldsymbol{f}_{T}$ and $\boldsymbol{f}_{S}$ for non-deterministic path planning and pose synthesizing. Then, the motion analyzer extracts latent features from a synthesized human-scene interaction sample to recognize its action category and interaction object.
  • Figure 4: Qualitative Comparison. We visualize four human-scene interaction samples synthesized from each text-to-motion generation method. Dotted boxes indicate the imperfections reflected in poor motion naturalness, unrealistic human-scene interactions, and mistaken interaction objects. The color of the pose deepens over time.
  • Figure 5: Motion synthesis and analysis results conditioned on the given text inputs. We indicate the correct and incorrect inferred results of ACT and OBJ with green and red, respectively.
  • ...and 7 more figures