Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Zan Wang; Yixin Chen; Baoxiong Jia; Puhao Li; Jinlu Zhang; Jingze Zhang; Tengyu Liu; Yixin Zhu; Wei Liang; Siyuan Huang

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Zan Wang, Yixin Chen, Baoxiong Jia, Puhao Li, Jinlu Zhang, Jingze Zhang, Tengyu Liu, Yixin Zhu, Wei Liang, Siyuan Huang

TL;DR

This work tackles language-guided human motion generation in 3D scenes under data scarcity by introducing a two-stage diffusion framework that uses scene affordance as an intermediate representation. The Affordance Diffusion Model predicts scene-language grounded affordance maps, which the Affordance-to-Motion Diffusion Model then uses, along with language and scene context, to synthesize plausible motions. Empirical results on HumanML3D and HUMANISE show state-of-the-art performance and strong generalization to unseen descriptions and scenes, validating the affordance-based grounding strategy. The approach advances controllable, semantically coherent motion generation in 3D environments and offers a data-efficient path for multimodal integration in embodied AI tasks.

Abstract

Despite significant advancements in text-to-motion synthesis, generating language-guided human motion within 3D environments poses substantial challenges. These challenges stem primarily from (i) the absence of powerful generative models capable of jointly modeling natural language, 3D scenes, and human motion, and (ii) the generative models' intensive data requirements contrasted with the scarcity of comprehensive, high-quality, language-scene-motion datasets. To tackle these issues, we introduce a novel two-stage framework that employs scene affordance as an intermediate representation, effectively linking 3D scene grounding and conditional motion generation. Our framework comprises an Affordance Diffusion Model (ADM) for predicting explicit affordance map and an Affordance-to-Motion Diffusion Model (AMDM) for generating plausible human motions. By leveraging scene affordance maps, our method overcomes the difficulty in generating human motion under multimodal condition signals, especially when training with limited data lacking extensive language-scene-motion pairs. Our extensive experiments demonstrate that our approach consistently outperforms all baselines on established benchmarks, including HumanML3D and HUMANISE. Additionally, we validate our model's exceptional generalization capabilities on a specially curated evaluation set featuring previously unseen descriptions and scenes.

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

TL;DR

Abstract

Paper Structure (48 sections, 6 equations, 9 figures, 7 tables)

This paper contains 48 sections, 6 equations, 9 figures, 7 tables.

Introduction
Related Work
Language, Human Motion, and 3D Scene
Conditional Human Motion Generation
Scene Affordance
Preliminaries
Diffusion Model
Problem Definition
Method
Affordance Map
Affordance Diffusion Model
Affordance-to-Motion Diffusion Model
Implementation Details
Experiments
Datasets
...and 33 more sections

Figures (9)

Figure 1: Language-guided human motion generation in 3D scenes via scene affordance. Employing scene affordance as an intermediate representation enhances motion generation capabilities on benchmarks (a) HumanML3D and (b) HUMANISE, and significantly boosts the model's ability to generalize to (c) unseen scenarios.
Figure 2: Overview of our method. To generate language-guided human motions in 3D scenes, our framework first predicts the scene affordance map in accordance with the language description using adm. Next, it generates interactive human motions with amdm conditioned on the predicted affordance map.
Figure 3: Qualitative results on HUMANISE dataset. The bottom-right figure provides a top-down view. Zoom in for better visualization.
Figure 4: Qualitative comparisons on generalization evaluation set. The first row is generated by the one-stage diffusion model and the second row is generated by our model. Our method can generate natural and accurately grounded human motions in unseen 3D scenes.
Figure 5: Failure cases. Our model fails while facing entirely unfamiliar hsi or too complex descriptions.
...and 4 more figures

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

TL;DR

Abstract

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Authors

TL;DR

Abstract

Table of Contents

Figures (9)