Table of Contents
Fetching ...

Generating Human Interaction Motions in Scenes with Text Control

Hongwei Yi, Justus Thies, Michael J. Black, Xue Bin Peng, Davis Rempe

TL;DR

TeSMo tackles the challenge of text-controlled, scene-aware human motion generation by combining a pre-trained, text-conditioned diffusion model with a scene-aware refinement branch. The method decomposes motion into navigation and interaction, using a two-stage process where a pelvis/root trajectory is generated in a scene via a 2D floor-map conditioned diffusion, then lifted to full-body motion with an in-painting model, followed by a dedicated interaction diffusion conditioned on object geometry. Key contributions include a new Loco-3D-FRONT dataset for scene-aware navigation, augmented data for robust scene-context training, and data-augmentation strategies that place interactions in varied 3D environments while preserving text controllability. Experimental results show TeSMo achieves competitive realism with scene-agnostic diffusion models and improves plausibility of human-scene interactions, supported by objective metrics and a user study; code will be released upon publication.

Abstract

We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model, emphasizing goal-reaching constraints on large-scale motion-capture datasets. We then enhance this model with a scene-aware component, fine-tuned using data augmented with detailed scene information, including ground plane and object shapes. To facilitate training, we embed annotated navigation and interaction motions within scenes. The proposed method produces realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses. Extensive experiments demonstrate that our approach surpasses prior techniques in terms of the plausibility of human-scene interactions, as well as the realism and variety of the generated motions. Code will be released upon publication of this work at https://research.nvidia.com/labs/toronto-ai/tesmo.

Generating Human Interaction Motions in Scenes with Text Control

TL;DR

TeSMo tackles the challenge of text-controlled, scene-aware human motion generation by combining a pre-trained, text-conditioned diffusion model with a scene-aware refinement branch. The method decomposes motion into navigation and interaction, using a two-stage process where a pelvis/root trajectory is generated in a scene via a 2D floor-map conditioned diffusion, then lifted to full-body motion with an in-painting model, followed by a dedicated interaction diffusion conditioned on object geometry. Key contributions include a new Loco-3D-FRONT dataset for scene-aware navigation, augmented data for robust scene-context training, and data-augmentation strategies that place interactions in varied 3D environments while preserving text controllability. Experimental results show TeSMo achieves competitive realism with scene-agnostic diffusion models and improves plausibility of human-scene interactions, supported by objective metrics and a user study; code will be released upon publication.

Abstract

We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Previous text-to-motion methods focus on characters in isolation without considering scenes due to the limited availability of datasets that include motion, text descriptions, and interactive scenes. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model, emphasizing goal-reaching constraints on large-scale motion-capture datasets. We then enhance this model with a scene-aware component, fine-tuned using data augmented with detailed scene information, including ground plane and object shapes. To facilitate training, we embed annotated navigation and interaction motions within scenes. The proposed method produces realistic and diverse human-object interactions, such as navigation and sitting, in different scenes with various object shapes, orientations, initial body positions, and poses. Extensive experiments demonstrate that our approach surpasses prior techniques in terms of the plausibility of human-scene interactions, as well as the realism and variety of the generated motions. Code will be released upon publication of this work at https://research.nvidia.com/labs/toronto-ai/tesmo.
Paper Structure (38 sections, 1 equation, 8 figures, 4 tables)

This paper contains 38 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: We present TeSMo, a method for generating diverse and plausible human-scene interactions from text input. Given a 3D scene, TeSMo generates scene-aware motions, such as walking in free space and sitting on a chair. Our model can be easily controlled using textual descriptions, start positions, and goal positions.
  • Figure 2: Pipeline overview: given the start position (green arrow), goal position (red arrow), 3D scene, and text description, the navigation root trajectory is first generated and then the full-body motion is completed through in-painting. Subsequently, the interaction is generated from a start pose (i.e., the end pose from navigation), goal position, and the target object, enabling the generation of object-specific motion.
  • Figure 3: Network architecture of the (a) root trajectory model and (b) interaction motion model. Initially, the base transformer encoder is trained on scene-agnostic motion data using start pose, target pose, and text as input. Subsequently, a scene-aware component is fine-tuned, which incorporates the 2D floor map (a) or 3D object (b).
  • Figure 4: (a) Loco-3D-FRONT contains locomotion placed in 3D-FRONT fu20213d scenes without collisions. (b) We augment SAMP 21iccv_samp by randomly selecting chairs from 3D-FRONT to match the motions and annotating a text description for each sub-sequence.
  • Figure 5: Navigation generation performance. The start pose is the green arrow, and the goal pose is the red arrow. Our method more accurately reaches the goal and avoids obstacles while style is controlled by a text prompt.
  • ...and 3 more figures