Table of Contents
Fetching ...

Chat2Layout: Interactive 3D Furniture Layout with a Multimodal LLM

Can Wang, Hongliang Zhong, Menglei Chai, Mingming He, Dongdong Chen, Jing Liao

TL;DR

Chat2Layout tackles interactive 3D furniture layout by deploying a multimodal LLM agent that communicates through a unified vision-question paradigm. The system employs training-free visual prompting and Offline-to-Online reference search (O2O-Search) to steer reasoning and planning, perceiving user requirements and 3D scenes to execute a sequence of 3D actions with iterative feedback. Key contributions include an MLLM-driven agent architecture, a visual-text prompting framework, an O2O-Search strategy for compact prompts, and multi-turn interaction enabling open-set furniture placement and wall placements. The results show improved layout quality, realism, and interactivity over baselines, highlighting the approach's potential to enhance open-ended interior-design workflows in 3D environments.

Abstract

Automatic furniture layout is long desired for convenient interior design. Leveraging the remarkable visual reasoning capabilities of multimodal large language models (MLLMs), recent methods address layout generation in a static manner, lacking the feedback-driven refinement essential for interactive user engagement. We introduce Chat2Layout, a novel interactive furniture layout generation system that extends the functionality of MLLMs into the realm of interactive layout design. To achieve this, we establish a unified vision-question paradigm for in-context learning, enabling seamless communication with MLLMs to steer their behavior without altering model weights. Within this framework, we present a novel training-free visual prompting mechanism. This involves a visual-text prompting technique that assist MLLMs in reasoning about plausible layout plans, followed by an Offline-to-Online search (O2O-Search) method, which automatically identifies the minimal set of informative references to provide exemplars for visual-text prompting. By employing an agent system with MLLMs as the core controller, we enable bidirectional interaction. The agent not only comprehends the 3D environment and user requirements through linguistic and visual perception but also plans tasks and reasons about actions to generate and arrange furniture within the virtual space. Furthermore, the agent iteratively updates based on visual feedback from execution results. Experimental results demonstrate that our approach facilitates language-interactive generation and arrangement for diverse and complex 3D furniture.

Chat2Layout: Interactive 3D Furniture Layout with a Multimodal LLM

TL;DR

Chat2Layout tackles interactive 3D furniture layout by deploying a multimodal LLM agent that communicates through a unified vision-question paradigm. The system employs training-free visual prompting and Offline-to-Online reference search (O2O-Search) to steer reasoning and planning, perceiving user requirements and 3D scenes to execute a sequence of 3D actions with iterative feedback. Key contributions include an MLLM-driven agent architecture, a visual-text prompting framework, an O2O-Search strategy for compact prompts, and multi-turn interaction enabling open-set furniture placement and wall placements. The results show improved layout quality, realism, and interactivity over baselines, highlighting the approach's potential to enhance open-ended interior-design workflows in 3D environments.

Abstract

Automatic furniture layout is long desired for convenient interior design. Leveraging the remarkable visual reasoning capabilities of multimodal large language models (MLLMs), recent methods address layout generation in a static manner, lacking the feedback-driven refinement essential for interactive user engagement. We introduce Chat2Layout, a novel interactive furniture layout generation system that extends the functionality of MLLMs into the realm of interactive layout design. To achieve this, we establish a unified vision-question paradigm for in-context learning, enabling seamless communication with MLLMs to steer their behavior without altering model weights. Within this framework, we present a novel training-free visual prompting mechanism. This involves a visual-text prompting technique that assist MLLMs in reasoning about plausible layout plans, followed by an Offline-to-Online search (O2O-Search) method, which automatically identifies the minimal set of informative references to provide exemplars for visual-text prompting. By employing an agent system with MLLMs as the core controller, we enable bidirectional interaction. The agent not only comprehends the 3D environment and user requirements through linguistic and visual perception but also plans tasks and reasons about actions to generate and arrange furniture within the virtual space. Furthermore, the agent iteratively updates based on visual feedback from execution results. Experimental results demonstrate that our approach facilitates language-interactive generation and arrangement for diverse and complex 3D furniture.
Paper Structure (26 sections, 2 equations, 15 figures, 3 tables)

This paper contains 26 sections, 2 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Chat2Layout features a multimodal LLM agent designed to facilitate natural language interaction between users and 3D indoor environments. Users can provide a wide range of instructions, from abstract requests (texts in red) to specific commands (texts in blue), whether in isolation (the first chat) or as part of a continuous dialogue (subsequent interactions). The agent interprets these instructions to autonomously executes corresponding operations within the 3D environment. This enables a multi-turn conversation process, empowering users to provide feedback and engage dynamically with the environment. Chat2Layout supports a variety of applications, including object removal or addition (texts in bold), rotation (texts in green), scaling (texts in pink), and re-arrangement (texts in yellow). Additionally, it enables 3D layout generation that involves walls (texts in purple).
  • Figure 2: The framework of Chat2Layout. Our agent begins by perceiving the user's requirements, along with the indoor scene descriptions and visual captures provided by the observer from the environment. Within the brain module, the agent translating requirements into a task and decomposing it into a sequence of atomic tasks. It then reasons about the necessary 3D actions for each atomic task, generating a sequence of 3D actions that are subsequently executed in the action module, resulting in modifications to the environment. Users can then observe the updated 3D indoor scene and provide new requirements. Throughout this process, all interactions with the MLLM adhere to the vision-question paradigm for in-context learning, which adopts visual-text prompting to assist the agent in reasoning and decision-making and Offline-to-Online Search to provide the support set as references for prompting.
  • Figure 3: Vision-question paradigm for task decomposition. We decompose a task into a sequence of atomic tasks using vision-question paradigm.
  • Figure 4: Grid coverage algorithm for floor placement. We conduct a local by local placement for a large set of furniture, and we propose a scale-adaptive grid placement for large-size furniture items.
  • Figure 5: A visual content for initial orientation prediction. The agent will select the initial orientation of the chair from four views.
  • ...and 10 more figures