Functional 3D Scene Synthesis through Human-Scene Optimization
Yao Wei, Matteo Toso, Pietro Morerio, Michael Ying Yang, Alessio Del Bue
TL;DR
This work tackles the challenge of generating 3D indoor scenes from text by enforcing human usability through a human–scene interaction prior. It introduces a three-stage pipeline—Reasoning, 3D Assembly, and Optimization—that constructs a commonsense scene graph via graph diffusion and CLIP conditioning, assembles a 3D layout with retrieved meshes and SMPL-X humans, and refines the scene to maximize functional human–object contacts while avoiding interpenetration. The approach leverages LLM-driven action priors and zero-shot capabilities, yielding superior iRecall and SCA metrics across bedroom, living room, and dining room scenes on 3D-FRONT, with strong qualitative improvements over InstructScene and related baselines. This functional, human-centric perspective improves realism and practical usability of generated environments for AR/VR and design workflows, and the framework can extend to additional human actions and domains, albeit with limitations due to diffusion randomness and simplified motion modeling.
Abstract
This paper presents a novel generative approach that outputs 3D indoor environments solely from a textual description of the scene. Current methods often treat scene synthesis as a mere layout prediction task, leading to rooms with overlapping objects or overly structured scenes, with limited consideration of the practical usability of the generated environment. Instead, our approach is based on a simple, but effective principle: we condition scene synthesis to generate rooms that are usable by humans. This principle is implemented by synthesizing 3D humans that interact with the objects composing the scene. If this human-centric scene generation is viable, the room layout is functional and it leads to a more coherent 3D structure. To this end, we propose a novel method for functional 3D scene synthesis, which consists of reasoning, 3D assembling and optimization. We regard text guided 3D synthesis as a reasoning process by generating a scene graph via a graph diffusion network. Considering object functional co-occurrence, a new strategy is designed to better accommodate human-object interaction and avoidance, achieving human-aware 3D scene optimization. We conduct both qualitative and quantitative experiments to validate the effectiveness of our method in generating coherent 3D scene synthesis results.
