Table of Contents
Fetching ...

Hierarchically-Structured Open-Vocabulary Indoor Scene Synthesis with Pre-trained Large Language Model

Weilin Sun, Xinran Li, Manyi Li, Kai Xu, Xiangxu Meng, Lei Meng

TL;DR

This paper trains a hierarchy-aware network to infer the fine-grained relative positions between objects and design a divide-and-conquer optimization to solve for scene layouts, which can generate more reasonable scene layouts while better aligned with the user requirements and LLM descriptions.

Abstract

Indoor scene synthesis aims to automatically produce plausible, realistic and diverse 3D indoor scenes, especially given arbitrary user requirements. Recently, the promising generalization ability of pre-trained large language models (LLM) assist in open-vocabulary indoor scene synthesis. However, the challenge lies in converting the LLM-generated outputs into reasonable and physically feasible scene layouts. In this paper, we propose to generate hierarchically structured scene descriptions with LLM and then compute the scene layouts. Specifically, we train a hierarchy-aware network to infer the fine-grained relative positions between objects and design a divide-and-conquer optimization to solve for scene layouts. The advantages of using hierarchically structured scene representation are two-fold. First, the hierarchical structure provides a rough grounding for object arrangement, which alleviates contradictory placements with dense relations and enhances the generalization ability of the network to infer fine-grained placements. Second, it naturally supports the divide-and-conquer optimization, by first arranging the sub-scenes and then the entire scene, to more effectively solve for a feasible layout. We conduct extensive comparison experiments and ablation studies with both qualitative and quantitative evaluations to validate the effectiveness of our key designs with the hierarchically structured scene representation. Our approach can generate more reasonable scene layouts while better aligned with the user requirements and LLM descriptions. We also present open-vocabulary scene synthesis and interactive scene design results to show the strength of our approach in the applications.

Hierarchically-Structured Open-Vocabulary Indoor Scene Synthesis with Pre-trained Large Language Model

TL;DR

This paper trains a hierarchy-aware network to infer the fine-grained relative positions between objects and design a divide-and-conquer optimization to solve for scene layouts, which can generate more reasonable scene layouts while better aligned with the user requirements and LLM descriptions.

Abstract

Indoor scene synthesis aims to automatically produce plausible, realistic and diverse 3D indoor scenes, especially given arbitrary user requirements. Recently, the promising generalization ability of pre-trained large language models (LLM) assist in open-vocabulary indoor scene synthesis. However, the challenge lies in converting the LLM-generated outputs into reasonable and physically feasible scene layouts. In this paper, we propose to generate hierarchically structured scene descriptions with LLM and then compute the scene layouts. Specifically, we train a hierarchy-aware network to infer the fine-grained relative positions between objects and design a divide-and-conquer optimization to solve for scene layouts. The advantages of using hierarchically structured scene representation are two-fold. First, the hierarchical structure provides a rough grounding for object arrangement, which alleviates contradictory placements with dense relations and enhances the generalization ability of the network to infer fine-grained placements. Second, it naturally supports the divide-and-conquer optimization, by first arranging the sub-scenes and then the entire scene, to more effectively solve for a feasible layout. We conduct extensive comparison experiments and ablation studies with both qualitative and quantitative evaluations to validate the effectiveness of our key designs with the hierarchically structured scene representation. Our approach can generate more reasonable scene layouts while better aligned with the user requirements and LLM descriptions. We also present open-vocabulary scene synthesis and interactive scene design results to show the strength of our approach in the applications.

Paper Structure

This paper contains 13 sections, 5 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Compared to LLM-generated (a) numerical layouts and (b) scene graphs with dense relations, we use (c) LLM-generated hierarchical scene descriptions, whose internal nodes represent functional areas with compact and generalizable prior, to generate more reasonable and physically feasible scene layouts aligned with the descriptions.
  • Figure 2: Our three-level hierarchical scene structure with the functional area as internal nodes.
  • Figure 3: Our hierarchically-structured scene synthesis pipeline involves three stages. First, given a user requirement, we prompt the LLM to generate a hierarchical structure with text descriptions. Second, we train a hierarchy-aware network to infer the fine-grained relative placements between objects. Third, we use a divide-and-conquer optimization algorithm to arrange each functional area separately and then organize them into the entire scene. $z_e$ indicates the random sampling from the Gaussian distribution $Z$, and $\hat{r}_e$ indicates the relative position information after decoding.
  • Figure 4: The scenes generated from different approaches. The end-to-end data-driven approaches often produce infeasible object placements including overlap and out-of-boundary cases (red boxes). On the other hand, HOLODECK sometimes produces unreasonable results such as two nightstands on the same side of the bed (top row) or a tv stand on the left side of the sofa (second row). By contrast, our approach is able to more effectively produce reasonable and physically feasible scene layouts.
  • Figure 5: Topview visualizations of the ablation study.
  • ...and 2 more figures