Table of Contents
Fetching ...

Diverse 3D Human Pose Generation in Scenes based on Decoupled Structure

Bowen Dang, Xi Zhao

TL;DR

The paper addresses the challenge of generating diverse, semantically controlled 3D human poses in scenes, noting that current methods are hampered by reliance on limited interaction datasets. It proposes a decoupled, three-stage pipeline—pose generation, contact generation, and placement—where a pose generator trained on AMASS provides a rich pose prior, a contact generator trained on PROX encodes human–scene contact priors, and a placement module integrates the human into the scene via initial placement, feasibility checks, and optimization using SMPL-X. Key contributions include (i) a pose–interactions decoupled framework, (ii) a physical feasibility test to prune unrealistic placements, and (iii) an optimization stage to refine poses for natural interactions, with experiments showing increased physical plausibility and pose diversity on PROX and generalization to MP3D-R. The approach reduces reliance on interaction datasets while enabling controllable, diverse, and realistic scene-embedded human poses for AR/VR, gaming, and data generation.

Abstract

This paper presents a novel method for generating diverse 3D human poses in scenes with semantic control. Existing methods heavily rely on the human-scene interaction dataset, resulting in a limited diversity of the generated human poses. To overcome this challenge, we propose to decouple the pose and interaction generation process. Our approach consists of three stages: pose generation, contact generation, and putting human into the scene. We train a pose generator on the human dataset to learn rich pose prior, and a contact generator on the human-scene interaction dataset to learn human-scene contact prior. Finally, the placing module puts the human body into the scene in a suitable and natural manner. The experimental results on the PROX dataset demonstrate that our method produces more physically plausible interactions and exhibits more diverse human poses. Furthermore, experiments on the MP3D-R dataset further validates the generalization ability of our method.

Diverse 3D Human Pose Generation in Scenes based on Decoupled Structure

TL;DR

The paper addresses the challenge of generating diverse, semantically controlled 3D human poses in scenes, noting that current methods are hampered by reliance on limited interaction datasets. It proposes a decoupled, three-stage pipeline—pose generation, contact generation, and placement—where a pose generator trained on AMASS provides a rich pose prior, a contact generator trained on PROX encodes human–scene contact priors, and a placement module integrates the human into the scene via initial placement, feasibility checks, and optimization using SMPL-X. Key contributions include (i) a pose–interactions decoupled framework, (ii) a physical feasibility test to prune unrealistic placements, and (iii) an optimization stage to refine poses for natural interactions, with experiments showing increased physical plausibility and pose diversity on PROX and generalization to MP3D-R. The approach reduces reliance on interaction datasets while enabling controllable, diverse, and realistic scene-embedded human poses for AR/VR, gaming, and data generation.

Abstract

This paper presents a novel method for generating diverse 3D human poses in scenes with semantic control. Existing methods heavily rely on the human-scene interaction dataset, resulting in a limited diversity of the generated human poses. To overcome this challenge, we propose to decouple the pose and interaction generation process. Our approach consists of three stages: pose generation, contact generation, and putting human into the scene. We train a pose generator on the human dataset to learn rich pose prior, and a contact generator on the human-scene interaction dataset to learn human-scene contact prior. Finally, the placing module puts the human body into the scene in a suitable and natural manner. The experimental results on the PROX dataset demonstrate that our method produces more physically plausible interactions and exhibits more diverse human poses. Furthermore, experiments on the MP3D-R dataset further validates the generalization ability of our method.
Paper Structure (19 sections, 5 equations, 8 figures, 3 tables)

This paper contains 19 sections, 5 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: The overview of our method. The input are an action-object pair and the scene mesh. The output is the human body mesh placed in the scene. In the first stage, we generate a desired human body model using the pose generator. In the second stage, we generate the contact feature map for the human body mesh using the contact generator. Finally, we put the human body into the scene. The last stage can be further divided into three sub-stages, including initial position selection, physical feasibility test, and optimization.
  • Figure 2: The network structure of the pose generator. The input and output are both the body pose. The action code serves as the conditional input to control the generated body pose.
  • Figure 3: The network structure of the contact generator. The input and output are both the contact feature. The simplified body mesh and object code serve as the conditional input to control the generated contact feature.
  • Figure 4: Examples of bad positions. (a) and (b) show the situations with severe penetrations. (c) and (d) show the situations without reasonable contact.
  • Figure 5: Gallery of our results. The first and second row denotes results on the PROX and MP3D-R dataset respectively. The last row denotes results under uncommon interactions.
  • ...and 3 more figures