Table of Contents
Fetching ...

SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

Jun Luo, Jiaxiang Tang, Ruijie Lu, Gang Zeng

Abstract

Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant

SceneAssistant: A Visual Feedback Agent for Open-Vocabulary 3D Scene Generation

Abstract

Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant
Paper Structure (26 sections, 10 figures, 2 tables)

This paper contains 26 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Open-vocabulary text-to-3D scene generation via SceneAssistant. Each scene exhibits high fidelity to long-tail objects and intricate spatial constraints described in the text.
  • Figure 2: The SceneAssistant framework and its iterative scene generation process. Our approach utilizes a vision-feedback-driven closed loop (bottom left) where a VLM agent processes multimodal context—including rendered views, scene metadata, and system messages—to generate reasoning and actions following the ReActyao2022react paradigm. This framework enables an iterative refinement process (top), progressively building complex scenes from step $1$ to $T$. As highlighted in steps $8$ and $9$ (bottom right), the agent can dynamically respond to system-generated collision warnings, performing corrective manipulations to resolve spatial inconsistencies and achieve a high-quality final 3D layout.
  • Figure 3: Human-agent collaboration for interactive scene editing. SceneAssistant effectively interprets user-provided message and produces the corresponding scene.
  • Figure 4: Qualitative comparison of indoor scene generation. We compare SceneAssistant with Holodeckholodeck and SceneWeaversceneweaver across various indoor categories. SceneAssistant performs precise spatial arrangements and faithfully reconstructs nuanced objects often simplified or omitted by existing baselines.
  • Figure 5: Qualitative results of open-vocabulary scene generation. SceneAssistant consistently produces superior spatial layouts and semantic consistency.
  • ...and 5 more figures