Table of Contents
Fetching ...

Human-Aware 3D Scene Generation with Spatially-constrained Diffusion Models

Xiaolin Hong, Hongwei Yi, Fazhi He, Qiong Cao

TL;DR

This work has developed an automated pipeline that improves the variety and plausibility of human-object interactions in the existing 3D FRONT HUMAN dataset and introduces two spatial collision guidance mechanisms: human-object collision avoidance and object-room boundary constraints.

Abstract

Generating 3D scenes from human motion sequences supports numerous applications, including virtual reality and architectural design. However, previous auto-regression-based human-aware 3D scene generation methods have struggled to accurately capture the joint distribution of multiple objects and input humans, often resulting in overlapping object generation in the same space. To address this limitation, we explore the potential of diffusion models that simultaneously consider all input humans and the floor plan to generate plausible 3D scenes. Our approach not only satisfies all input human interactions but also adheres to spatial constraints with the floor plan. Furthermore, we introduce two spatial collision guidance mechanisms: human-object collision avoidance and object-room boundary constraints. These mechanisms help avoid generating scenes that conflict with human motions while respecting layout constraints. To enhance the diversity and accuracy of human-guided scene generation, we have developed an automated pipeline that improves the variety and plausibility of human-object interactions in the existing 3D FRONT HUMAN dataset. Extensive experiments on both synthetic and real-world datasets demonstrate that our framework can generate more natural and plausible 3D scenes with precise human-scene interactions, while significantly reducing human-object collisions compared to previous state-of-the-art methods. Our code and data will be made publicly available upon publication of this work.

Human-Aware 3D Scene Generation with Spatially-constrained Diffusion Models

TL;DR

This work has developed an automated pipeline that improves the variety and plausibility of human-object interactions in the existing 3D FRONT HUMAN dataset and introduces two spatial collision guidance mechanisms: human-object collision avoidance and object-room boundary constraints.

Abstract

Generating 3D scenes from human motion sequences supports numerous applications, including virtual reality and architectural design. However, previous auto-regression-based human-aware 3D scene generation methods have struggled to accurately capture the joint distribution of multiple objects and input humans, often resulting in overlapping object generation in the same space. To address this limitation, we explore the potential of diffusion models that simultaneously consider all input humans and the floor plan to generate plausible 3D scenes. Our approach not only satisfies all input human interactions but also adheres to spatial constraints with the floor plan. Furthermore, we introduce two spatial collision guidance mechanisms: human-object collision avoidance and object-room boundary constraints. These mechanisms help avoid generating scenes that conflict with human motions while respecting layout constraints. To enhance the diversity and accuracy of human-guided scene generation, we have developed an automated pipeline that improves the variety and plausibility of human-object interactions in the existing 3D FRONT HUMAN dataset. Extensive experiments on both synthetic and real-world datasets demonstrate that our framework can generate more natural and plausible 3D scenes with precise human-scene interactions, while significantly reducing human-object collisions compared to previous state-of-the-art methods. Our code and data will be made publicly available upon publication of this work.

Paper Structure

This paper contains 11 sections, 7 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Our method generates more plausible 3D scenes given input human motions and floor plans. It excels in two key aspects: (1) avoiding collisions between humans and objects, as well as between objects, a significant improvement over MIME yi2022mime, and (2) providing better support for human-object interactions compared to DiffuScene tang2023diffuscene.
  • Figure 2: Overview of our method. SHADE learns a diffusion model to gradually clean the noisy scene $\mathbf{x}_T$ by simultaneously considering the contact bounding boxes, free-space mask, floor plan, and time step. During inference, SHADE applies three spatial collision guidance functions to ensure the generation of plausible scenes that avoid conflicts with human motions, room boundaries, as well as prevent object overlap.
  • Figure 3: Comparison between 3D-FRONT HUMAN yi2022mime and our calibrated dataset. We correct human-object penetrations through translation modification to improve spatial accuracy. Additionally, we apply category and orientation augmentation to enhance the diversity in interactions.
  • Figure 4: Qualitative comparison on the test split in calibrated 3D FRONT HUMAN. Compared with existing state-of-the-art methods MIME and DiffuScene, our method generates more plausible scenes that avoid conflict with free-space humans and room boundaries, and present fewer overlapping objects. Each row represent an example input.
  • Figure 5: Ablation on spatial collision guidance functions. The left column shows scenes generated without guidance, with red boxes indicating constraint violations. The right column shows scenes with guidance, where green boxes highlight the improvements.