Diverse 3D Human Pose Generation in Scenes based on Decoupled Structure
Bowen Dang, Xi Zhao
TL;DR
The paper addresses the challenge of generating diverse, semantically controlled 3D human poses in scenes, noting that current methods are hampered by reliance on limited interaction datasets. It proposes a decoupled, three-stage pipeline—pose generation, contact generation, and placement—where a pose generator trained on AMASS provides a rich pose prior, a contact generator trained on PROX encodes human–scene contact priors, and a placement module integrates the human into the scene via initial placement, feasibility checks, and optimization using SMPL-X. Key contributions include (i) a pose–interactions decoupled framework, (ii) a physical feasibility test to prune unrealistic placements, and (iii) an optimization stage to refine poses for natural interactions, with experiments showing increased physical plausibility and pose diversity on PROX and generalization to MP3D-R. The approach reduces reliance on interaction datasets while enabling controllable, diverse, and realistic scene-embedded human poses for AR/VR, gaming, and data generation.
Abstract
This paper presents a novel method for generating diverse 3D human poses in scenes with semantic control. Existing methods heavily rely on the human-scene interaction dataset, resulting in a limited diversity of the generated human poses. To overcome this challenge, we propose to decouple the pose and interaction generation process. Our approach consists of three stages: pose generation, contact generation, and putting human into the scene. We train a pose generator on the human dataset to learn rich pose prior, and a contact generator on the human-scene interaction dataset to learn human-scene contact prior. Finally, the placing module puts the human body into the scene in a suitable and natural manner. The experimental results on the PROX dataset demonstrate that our method produces more physically plausible interactions and exhibits more diverse human poses. Furthermore, experiments on the MP3D-R dataset further validates the generalization ability of our method.
