Table of Contents
Fetching ...

Layout Generation Agents with Large Language Models

Yuichi Sasazawa, Yasuhiro Sogawa

TL;DR

This work addresses the efficiency gap in creating customizable 3D virtual spaces by introducing an agent-driven layout generation system guided by the multimodal GPT-4V model. The method orchestrates sequential object placement via agents, with 3D objects generated on demand by Shap-E and controlled through JSON-formatted actions (move_cursor, place_object, finish_action). An ablation study identifies key contributors to performance, notably the importance of bounding box text and absolute positioning, while past action history and chain-of-thought provide additional gains. The approach enables generic, domain-independent layout generation by incorporating spatial state during generation, offering practical impact for rapid creation of diverse virtual environments.

Abstract

In recent years, there has been an increasing demand for customizable 3D virtual spaces. Due to the significant human effort required to create these virtual spaces, there is a need for efficiency in virtual space creation. While existing studies have proposed methods for automatically generating layouts such as floor plans and furniture arrangements, these methods only generate text indicating the layout structure based on user instructions, without utilizing the information obtained during the generation process. In this study, we propose an agent-driven layout generation system using the GPT-4V multimodal large language model and validate its effectiveness. Specifically, the language model manipulates agents to sequentially place objects in the virtual space, thus generating layouts that reflect user instructions. Experimental results confirm that our proposed method can generate virtual spaces reflecting user instructions with a high success rate. Additionally, we successfully identified elements contributing to the improvement in behavior generation performance through ablation study.

Layout Generation Agents with Large Language Models

TL;DR

This work addresses the efficiency gap in creating customizable 3D virtual spaces by introducing an agent-driven layout generation system guided by the multimodal GPT-4V model. The method orchestrates sequential object placement via agents, with 3D objects generated on demand by Shap-E and controlled through JSON-formatted actions (move_cursor, place_object, finish_action). An ablation study identifies key contributors to performance, notably the importance of bounding box text and absolute positioning, while past action history and chain-of-thought provide additional gains. The approach enables generic, domain-independent layout generation by incorporating spatial state during generation, offering practical impact for rapid creation of diverse virtual environments.

Abstract

In recent years, there has been an increasing demand for customizable 3D virtual spaces. Due to the significant human effort required to create these virtual spaces, there is a need for efficiency in virtual space creation. While existing studies have proposed methods for automatically generating layouts such as floor plans and furniture arrangements, these methods only generate text indicating the layout structure based on user instructions, without utilizing the information obtained during the generation process. In this study, we propose an agent-driven layout generation system using the GPT-4V multimodal large language model and validate its effectiveness. Specifically, the language model manipulates agents to sequentially place objects in the virtual space, thus generating layouts that reflect user instructions. Experimental results confirm that our proposed method can generate virtual spaces reflecting user instructions with a high success rate. Additionally, we successfully identified elements contributing to the improvement in behavior generation performance through ablation study.
Paper Structure (12 sections, 4 figures, 2 tables)

This paper contains 12 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The overview of the task addressed in this study. When the user inputs the instruction text, the proposed method automatically places the object in the appropriate position in sequence, thereby generating a layout that reflects the user's instruction. The gray lines in the image are a grid that conveys coordinate information to the model.
  • Figure 2: The overview of the algorithm of the proposed method. At each step, necessary information is given to the action generation model (GPT-4V), and the model generates JSON text indicating the next action to be taken by the agent. This process is repeated until the "finish_action" instruction is output, i.e., until the action generation model determines that it has generated a layout that satisfies the user's instruction. Objects to be placed in the virtual space are automatically generated using the 3D object generation model (Shap-E).
  • Figure 3: Information to be given to the action generation model
  • Figure 4: Sample of generated results. Case 1 shows an example of placing a tree near a house on a meadow, and Case 2 shows an example of placing furniture in a room (the brown frame is a wall).