Table of Contents
Fetching ...

Towards Language-guided Interactive 3D Generation: LLMs as Layout Interpreter with Generative Feedback

Yiqi Lin, Hao Wu, Ruichen Wang, Haonan Lu, Xiaodong Lin, Hui Xiong, Lin Wang

TL;DR

This work tackles the challenge of open-vocabulary, language-driven interactive 3D scene generation by introducing LI3D, a system that uses LLMs as a 3D layout interpreter to steer layout-to-3D generators like CompoNeRF. It couples this with a Generative Feedback Module powered by LLaVA to verify rendered content against descriptions and provide detailed feedback to refine layouts across multi-round interactions. The approach supports extension to 2D generation and editing, and is validated through quantitative benchmarks on i-CLEVR and qualitative demonstrations, including object-level edits and multi-object scenes. The results suggest that LLMs can serve as a spatial commonsense reasoning engine, enabling flexible, human-in-the-loop content creation with practical implications for metaverse and gaming applications, while also highlighting current limitations in stability and consistency.

Abstract

Generating and editing a 3D scene guided by natural language poses a challenge, primarily due to the complexity of specifying the positional relations and volumetric changes within the 3D space. Recent advancements in Large Language Models (LLMs) have demonstrated impressive reasoning, conversational, and zero-shot generation abilities across various domains. Surprisingly, these models also show great potential in realizing and interpreting the 3D space. In light of this, we propose a novel language-guided interactive 3D generation system, dubbed LI3D, that integrates LLMs as a 3D layout interpreter into the off-the-shelf layout-to-3D generative models, allowing users to flexibly and interactively generate visual content. Specifically, we design a versatile layout structure base on the bounding boxes and semantics to prompt the LLMs to model the spatial generation and reasoning from language. Our system also incorporates LLaVA, a large language and vision assistant, to provide generative feedback from the visual aspect for improving the visual quality of generated content. We validate the effectiveness of LI3D, primarily in 3D generation and editing through multi-round interactions, which can be flexibly extended to 2D generation and editing. Various experiments demonstrate the potential benefits of incorporating LLMs in generative AI for applications, e.g., metaverse. Moreover, we benchmark the layout reasoning performance of LLMs with neural visual artist tasks, revealing their emergent ability in the spatial layout domain.

Towards Language-guided Interactive 3D Generation: LLMs as Layout Interpreter with Generative Feedback

TL;DR

This work tackles the challenge of open-vocabulary, language-driven interactive 3D scene generation by introducing LI3D, a system that uses LLMs as a 3D layout interpreter to steer layout-to-3D generators like CompoNeRF. It couples this with a Generative Feedback Module powered by LLaVA to verify rendered content against descriptions and provide detailed feedback to refine layouts across multi-round interactions. The approach supports extension to 2D generation and editing, and is validated through quantitative benchmarks on i-CLEVR and qualitative demonstrations, including object-level edits and multi-object scenes. The results suggest that LLMs can serve as a spatial commonsense reasoning engine, enabling flexible, human-in-the-loop content creation with practical implications for metaverse and gaming applications, while also highlighting current limitations in stability and consistency.

Abstract

Generating and editing a 3D scene guided by natural language poses a challenge, primarily due to the complexity of specifying the positional relations and volumetric changes within the 3D space. Recent advancements in Large Language Models (LLMs) have demonstrated impressive reasoning, conversational, and zero-shot generation abilities across various domains. Surprisingly, these models also show great potential in realizing and interpreting the 3D space. In light of this, we propose a novel language-guided interactive 3D generation system, dubbed LI3D, that integrates LLMs as a 3D layout interpreter into the off-the-shelf layout-to-3D generative models, allowing users to flexibly and interactively generate visual content. Specifically, we design a versatile layout structure base on the bounding boxes and semantics to prompt the LLMs to model the spatial generation and reasoning from language. Our system also incorporates LLaVA, a large language and vision assistant, to provide generative feedback from the visual aspect for improving the visual quality of generated content. We validate the effectiveness of LI3D, primarily in 3D generation and editing through multi-round interactions, which can be flexibly extended to 2D generation and editing. Various experiments demonstrate the potential benefits of incorporating LLMs in generative AI for applications, e.g., metaverse. Moreover, we benchmark the layout reasoning performance of LLMs with neural visual artist tasks, revealing their emergent ability in the spatial layout domain.
Paper Structure (20 sections, 3 equations, 12 figures, 1 table)

This paper contains 20 sections, 3 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: The overall system of LI3D. (a) LI3D utilize the LLM to interpret language input into the 3D layout for layout-to-3D generative model (CompoNeRF lin2023componerf) as conditional input (Sec. \ref{['sec:li3d']}), which can also be extended to the image domain by several adaptions (Sec. \ref{['sec:2d']}). (b) The LLaVA liu2023visual can be integrated into LI3D to predict generation quality and provide detailed description feedback for LLM to update the layout that fails to generate satisfactory content (Sec. \ref{['sec:fb']}).
  • Figure 2: Multiple rounds of interaction for 3D scene generation between users and LI3D.
  • Figure 3: Multiple rounds of interaction for 3D single object generation between users and LI3D.
  • Figure 4: Ablation study of generative feedback module.
  • Figure 5: Failure cases of LI3D.
  • ...and 7 more figures