Table of Contents
Fetching ...

Functional 3D Scene Synthesis through Human-Scene Optimization

Yao Wei, Matteo Toso, Pietro Morerio, Michael Ying Yang, Alessio Del Bue

TL;DR

This work tackles the challenge of generating 3D indoor scenes from text by enforcing human usability through a human–scene interaction prior. It introduces a three-stage pipeline—Reasoning, 3D Assembly, and Optimization—that constructs a commonsense scene graph via graph diffusion and CLIP conditioning, assembles a 3D layout with retrieved meshes and SMPL-X humans, and refines the scene to maximize functional human–object contacts while avoiding interpenetration. The approach leverages LLM-driven action priors and zero-shot capabilities, yielding superior iRecall and SCA metrics across bedroom, living room, and dining room scenes on 3D-FRONT, with strong qualitative improvements over InstructScene and related baselines. This functional, human-centric perspective improves realism and practical usability of generated environments for AR/VR and design workflows, and the framework can extend to additional human actions and domains, albeit with limitations due to diffusion randomness and simplified motion modeling.

Abstract

This paper presents a novel generative approach that outputs 3D indoor environments solely from a textual description of the scene. Current methods often treat scene synthesis as a mere layout prediction task, leading to rooms with overlapping objects or overly structured scenes, with limited consideration of the practical usability of the generated environment. Instead, our approach is based on a simple, but effective principle: we condition scene synthesis to generate rooms that are usable by humans. This principle is implemented by synthesizing 3D humans that interact with the objects composing the scene. If this human-centric scene generation is viable, the room layout is functional and it leads to a more coherent 3D structure. To this end, we propose a novel method for functional 3D scene synthesis, which consists of reasoning, 3D assembling and optimization. We regard text guided 3D synthesis as a reasoning process by generating a scene graph via a graph diffusion network. Considering object functional co-occurrence, a new strategy is designed to better accommodate human-object interaction and avoidance, achieving human-aware 3D scene optimization. We conduct both qualitative and quantitative experiments to validate the effectiveness of our method in generating coherent 3D scene synthesis results.

Functional 3D Scene Synthesis through Human-Scene Optimization

TL;DR

This work tackles the challenge of generating 3D indoor scenes from text by enforcing human usability through a human–scene interaction prior. It introduces a three-stage pipeline—Reasoning, 3D Assembly, and Optimization—that constructs a commonsense scene graph via graph diffusion and CLIP conditioning, assembles a 3D layout with retrieved meshes and SMPL-X humans, and refines the scene to maximize functional human–object contacts while avoiding interpenetration. The approach leverages LLM-driven action priors and zero-shot capabilities, yielding superior iRecall and SCA metrics across bedroom, living room, and dining room scenes on 3D-FRONT, with strong qualitative improvements over InstructScene and related baselines. This functional, human-centric perspective improves realism and practical usability of generated environments for AR/VR and design workflows, and the framework can extend to additional human actions and domains, albeit with limitations due to diffusion randomness and simplified motion modeling.

Abstract

This paper presents a novel generative approach that outputs 3D indoor environments solely from a textual description of the scene. Current methods often treat scene synthesis as a mere layout prediction task, leading to rooms with overlapping objects or overly structured scenes, with limited consideration of the practical usability of the generated environment. Instead, our approach is based on a simple, but effective principle: we condition scene synthesis to generate rooms that are usable by humans. This principle is implemented by synthesizing 3D humans that interact with the objects composing the scene. If this human-centric scene generation is viable, the room layout is functional and it leads to a more coherent 3D structure. To this end, we propose a novel method for functional 3D scene synthesis, which consists of reasoning, 3D assembling and optimization. We regard text guided 3D synthesis as a reasoning process by generating a scene graph via a graph diffusion network. Considering object functional co-occurrence, a new strategy is designed to better accommodate human-object interaction and avoidance, achieving human-aware 3D scene optimization. We conduct both qualitative and quantitative experiments to validate the effectiveness of our method in generating coherent 3D scene synthesis results.

Paper Structure

This paper contains 17 sections, 2 equations, 13 figures, 3 tables, 1 algorithm.

Figures (13)

  • Figure 1: The three stages of our scene synthesis pipeline. Given a text prompt $X$ and a scene type $V$, we a) extract a set of object categories $c$, features $f$ and possible human-object actions $a$, and pairwise spatial relationships $R$. These provide respectively the nodes, embeddings and directional edges of a noisy, partial graph that can be denoised by a graph diffusion network. b) A second diffusion process then generates the spatial layout of the object, assigning to each object a translation $t$, size $s$, and orientation $r$ while leaving the previously defined features unaltered; human actions and objects' categories are used to retrieve 3D meshes and place them at poses $(r, s, t)$. c) Iterating over all predicted human-object actions, we optimize the pose of any object that would results into an intersection of the human mesh and the scene's objects. During this stage all other features are left unaltered.
  • Figure 2: From text prompt to scene graph. Given the scene type $V$ and a text prompt $X$, we predict likely object categories $c$ by a) leveraging the learned distribution of objects in a collection of same-type 3D scenes and b) extracting from $X$ object categories and their spatial relationship $R$. Then, we c) use Llama to predict possible human contact interaction $a$ with the objects. d) We extract additional features $f$ from the text prompt using a pre-trained VQ-VAE; together, $c$, $f$ and $a$ provide node embeddings for an incomplete scene graph $G$, with edges $R$. e) We then use a graph diffusion network, conditioned on CLIP textual features extracted from $X$, to generate the complete scene graph $G$.
  • Figure 3: Qualitative results. Our method is able to recover from inaccurate layouts (e.g. nightstand tightly placed in between the bed and wardrobe in the first example or overlapping chairs in the second one) which are instead present in InstructScene lin2024instructscene, our baseline. As a reference, we show the ground-truth arrangement.
  • Figure 4: Ablations showing the effect of the reasoning stage. By sampling graph edges conditioned on graph nodes, reasoning stage results in more structured layouts. To show the details, the content highlighted in the blue box is zoomed in.
  • Figure 5: Ablations showing the effect of the optimization stage. The proposed strategy leads to better human-scene interaction and avoidance. To show the details, the content highlighted in the blue box is zoomed in.
  • ...and 8 more figures