Table of Contents
Fetching ...

Generating Human Motion in 3D Scenes from Text Descriptions

Zhi Cen, Huaijin Pi, Sida Peng, Zehong Shen, Minghui Yang, Shuai Zhu, Hujun Bao, Xiaowei Zhou

TL;DR

This work tackles generating human motions in 3D indoor scenes from text by decomposing the problem into language grounded object localization and object focused motion synthesis. It uses large language models to ground the target object via a two stage prompting strategy on scene graphs and then employs object centric volumetric sensors and diffusion models to generate trajectories and local motions conditioned on text. The method outperforms baselines on the HUMANISE dataset across scene alignment, action fidelity, and realism metrics, and shows zero shot generalization to unseen PROX scenes without fine tuning. The approach advances realistic human scene interactions by tightly coupling textual grounding with targeted motion generation, offering practical benefits for animated content, VR, and robotics in indoor environments.

Abstract

Generating human motions from textual descriptions has gained growing research interest due to its wide range of applications. However, only a few works consider human-scene interactions together with text conditions, which is crucial for visual and physical realism. This paper focuses on the task of generating human motions in 3D indoor scenes given text descriptions of the human-scene interactions. This task presents challenges due to the multi-modality nature of text, scene, and motion, as well as the need for spatial reasoning. To address these challenges, we propose a new approach that decomposes the complex problem into two more manageable sub-problems: (1) language grounding of the target object and (2) object-centric motion generation. For language grounding of the target object, we leverage the power of large language models. For motion generation, we design an object-centric scene representation for the generative model to focus on the target object, thereby reducing the scene complexity and facilitating the modeling of the relationship between human motions and the object. Experiments demonstrate the better motion quality of our approach compared to baselines and validate our design choices.

Generating Human Motion in 3D Scenes from Text Descriptions

TL;DR

This work tackles generating human motions in 3D indoor scenes from text by decomposing the problem into language grounded object localization and object focused motion synthesis. It uses large language models to ground the target object via a two stage prompting strategy on scene graphs and then employs object centric volumetric sensors and diffusion models to generate trajectories and local motions conditioned on text. The method outperforms baselines on the HUMANISE dataset across scene alignment, action fidelity, and realism metrics, and shows zero shot generalization to unseen PROX scenes without fine tuning. The approach advances realistic human scene interactions by tightly coupling textual grounding with targeted motion generation, offering practical benefits for animated content, VR, and robotics in indoor environments.

Abstract

Generating human motions from textual descriptions has gained growing research interest due to its wide range of applications. However, only a few works consider human-scene interactions together with text conditions, which is crucial for visual and physical realism. This paper focuses on the task of generating human motions in 3D indoor scenes given text descriptions of the human-scene interactions. This task presents challenges due to the multi-modality nature of text, scene, and motion, as well as the need for spatial reasoning. To address these challenges, we propose a new approach that decomposes the complex problem into two more manageable sub-problems: (1) language grounding of the target object and (2) object-centric motion generation. For language grounding of the target object, we leverage the power of large language models. For motion generation, we design an object-centric scene representation for the generative model to focus on the target object, thereby reducing the scene complexity and facilitating the modeling of the relationship between human motions and the object. Experiments demonstrate the better motion quality of our approach compared to baselines and validate our design choices.
Paper Structure (17 sections, 5 equations, 6 figures, 3 tables)

This paper contains 17 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Generating human motions in 3D scenes from text descriptions. Our method can generate human motions containing accurate human-object interactions in 3D scenes based on textural descriptions. Although our method is trained and tested on the HUMANISE dataset, it can generalize to other scenes, e.g., the scenes in the PROX dataset. Left: test results on the HUMANISE dataset. Right: generalization results on the PROX scenes.
  • Figure 2: Overview of our two-stage pipeline. In the first stage, given an input scene and a text description (a), we use ChatGPT to locate the target object (b). In the second stage, human motions are synthesized by first producing human trajectories (c) and then generating local poses (d).
  • Figure 3: Pipeline of localizing the target object. In stage 1, given the input text description and detected object bounding boxes (bbx), we construct the first prompt asking ChatGPT the categories of target objects and anchor objects. Based on the response, the scene graph can be simplified. In stage 2, we construct the second prompt with inputs and results from stage 1, including object relations derived from the simplified scene graph. The second prompt is designed for asking ChatGPT to infer the target object. Finally, we can get the target object bounding box from the response of ChatGPT.
  • Figure 4: The visualization of the environment sensor, target sensor, and trajectory sensor. The target sensor (b) gives detailed geometry of the target object. The environment sensor (c) gives coarse spatial information around the target object. The trajectory sensor (d) is located around the human.
  • Figure 5: Qualitative results. We compare our method with groundtruth and four baselines (please refer to Sec. \ref{['sec:compare']}) given the same text descriptions. Our method synthesizes motions that interact with the object precisely as the groundtruth data while the baselines fail.
  • ...and 1 more figures