Table of Contents
Fetching ...

Global-Local Tree Search in VLMs for 3D Indoor Scene Generation

Wei Deng, Mengshi Qi, Huadong Ma

TL;DR

This work tackles 3D indoor scene generation from natural language by leveraging Vision-Language Models (VLMs) and addressing the limitations of chain-like prompts with a global-local tree search. A hierarchical scene representation (room, region, floor object, supported object) serves as a proxy between text and 3D layouts, enabling region-wise planning and backtracking when conflicts arise. The method discretizes the top-down view with an emoji grid and prompts the VLM to reason about object placements, integrating a Tree-of-Thoughts-inspired search (DFS) with a global object-level strategy and a local per-object sub-task solver. Experiments show superior plausibility and realism compared to HoloDeck and AnyHome across multiple room types, with strong CLIP-based and human reciprocal-rank evidence, and a code release at the project page.

Abstract

Large Vision-Language Models (VLMs), such as GPT-4, have achieved remarkable success across various fields. However, there are few studies on 3D indoor scene generation with VLMs. This paper considers this task as a planning problem subject to spatial and layout common sense constraints. To solve the problem with a VLM, we propose a new global-local tree search algorithm. Globally, the method places each object sequentially and explores multiple placements during each placement process, where the problem space is represented as a tree. To reduce the depth of the tree, we decompose the scene structure hierarchically, i.e. room level, region level, floor object level, and supported object level. The algorithm independently generates the floor objects in different regions and supported objects placed on different floor objects. Locally, we also decompose the sub-task, the placement of each object, into multiple steps. The algorithm searches the tree of problem space. To leverage the VLM model to produce positions of objects, we discretize the top-down view space as a dense grid and fill each cell with diverse emojis to make to cells distinct. We prompt the VLM with the emoji grid and the VLM produces a reasonable location for the object by describing the position with the name of emojis. The quantitative and qualitative experimental results illustrate our approach generates more plausible 3D scenes than state-of-the-art approaches. Our source code is available at https://github.com/dw-dengwei/TreeSearchGen .

Global-Local Tree Search in VLMs for 3D Indoor Scene Generation

TL;DR

This work tackles 3D indoor scene generation from natural language by leveraging Vision-Language Models (VLMs) and addressing the limitations of chain-like prompts with a global-local tree search. A hierarchical scene representation (room, region, floor object, supported object) serves as a proxy between text and 3D layouts, enabling region-wise planning and backtracking when conflicts arise. The method discretizes the top-down view with an emoji grid and prompts the VLM to reason about object placements, integrating a Tree-of-Thoughts-inspired search (DFS) with a global object-level strategy and a local per-object sub-task solver. Experiments show superior plausibility and realism compared to HoloDeck and AnyHome across multiple room types, with strong CLIP-based and human reciprocal-rank evidence, and a code release at the project page.

Abstract

Large Vision-Language Models (VLMs), such as GPT-4, have achieved remarkable success across various fields. However, there are few studies on 3D indoor scene generation with VLMs. This paper considers this task as a planning problem subject to spatial and layout common sense constraints. To solve the problem with a VLM, we propose a new global-local tree search algorithm. Globally, the method places each object sequentially and explores multiple placements during each placement process, where the problem space is represented as a tree. To reduce the depth of the tree, we decompose the scene structure hierarchically, i.e. room level, region level, floor object level, and supported object level. The algorithm independently generates the floor objects in different regions and supported objects placed on different floor objects. Locally, we also decompose the sub-task, the placement of each object, into multiple steps. The algorithm searches the tree of problem space. To leverage the VLM model to produce positions of objects, we discretize the top-down view space as a dense grid and fill each cell with diverse emojis to make to cells distinct. We prompt the VLM with the emoji grid and the VLM produces a reasonable location for the object by describing the position with the name of emojis. The quantitative and qualitative experimental results illustrate our approach generates more plausible 3D scenes than state-of-the-art approaches. Our source code is available at https://github.com/dw-dengwei/TreeSearchGen .

Paper Structure

This paper contains 15 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Illustration of chain-like (left) and tree-like (right) reasoning in VLMs on 3D scene generation. Each node represents a token or language sequence. The red dashed nodes indicate the VLM produces an inappropriate output. The chain-like method cannot correct the prior errors and the subsequent process reasons based on the errors, leading to a non-realistic layout, such as exceeding the room. In contrast, the tree-like method can modify the output if a mistake occurs, resulting in a more realistic layout.
  • Figure 2: We prompt a VLM to generate the hierarchical scene representation level by level. From left to right, we decompose the scene into room, region, floor object, and supported object levels. The final representation is shown on the right-most side in this figure.
  • Figure 3: To generate a layout for a scene with quantities of objects, we independently generate the layout for each region. The global and local tree search method starts from the root node and goes deep by generating a thought. If the thought generator fails to produce a thought, it will trace back to the parent node and move to another thought.
  • Figure 4: We discretize the top-down view as a grid and fill the cells with emojis. The brick and white go emojis stand for the wall and region boundary respectively.
  • Figure 5: Performance comparison in terms of CLIP score by our proposed model with state-of-the-art methods.
  • ...and 3 more figures