Table of Contents
Fetching ...

SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code

Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, Alireza Fathi

TL;DR

SceneCraft tackles open-domain text-to-3D scene synthesis by turning descriptions into Blender-executable Python scripts through a dual-loop LLM agent. The inner loop performs per-scene layout optimization via a relational scene graph and constraint-based coding, while the outer loop distills recurring spatial patterns into a reusable library, enabling continuous self-improvement without LLM fine-tuning. Key contributions include the relational bipartite scene graph, a constraint-based solver with $F_r$ functions, a multimodal feedback-driven reviewer, and a sample-efficient library-learning pipeline, with strong quantitative and qualitative gains over BlenderGPT on synthetic data and demonstrated benefits for scene-guided video generation on Sintel. The approach offers scalable, automated tooling for architectural, cinematic, and game design pipelines, by integrating planning, coding, perception, and self-improvement into a cohesive framework.

Abstract

This paper introduces SceneCraft, a Large Language Model (LLM) Agent converting text descriptions into Blender-executable Python scripts which render complex scenes with up to a hundred 3D assets. This process requires complex spatial planning and arrangement. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning. SceneCraft first models a scene graph as a blueprint, detailing the spatial relationships among assets in the scene. SceneCraft then writes Python scripts based on this graph, translating relationships into numerical constraints for asset layout. Next, SceneCraft leverages the perceptual strengths of vision-language foundation models like GPT-V to analyze rendered images and iteratively refine the scene. On top of this process, SceneCraft features a library learning mechanism that compiles common script functions into a reusable library, facilitating continuous self-improvement without expensive LLM parameter tuning. Our evaluation demonstrates that SceneCraft surpasses existing LLM-based agents in rendering complex scenes, as shown by its adherence to constraints and favorable human assessments. We also showcase the broader application potential of SceneCraft by reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as intermediary control signal.

SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code

TL;DR

SceneCraft tackles open-domain text-to-3D scene synthesis by turning descriptions into Blender-executable Python scripts through a dual-loop LLM agent. The inner loop performs per-scene layout optimization via a relational scene graph and constraint-based coding, while the outer loop distills recurring spatial patterns into a reusable library, enabling continuous self-improvement without LLM fine-tuning. Key contributions include the relational bipartite scene graph, a constraint-based solver with functions, a multimodal feedback-driven reviewer, and a sample-efficient library-learning pipeline, with strong quantitative and qualitative gains over BlenderGPT on synthetic data and demonstrated benefits for scene-guided video generation on Sintel. The approach offers scalable, automated tooling for architectural, cinematic, and game design pipelines, by integrating planning, coding, perception, and self-improvement into a cohesive framework.

Abstract

This paper introduces SceneCraft, a Large Language Model (LLM) Agent converting text descriptions into Blender-executable Python scripts which render complex scenes with up to a hundred 3D assets. This process requires complex spatial planning and arrangement. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning. SceneCraft first models a scene graph as a blueprint, detailing the spatial relationships among assets in the scene. SceneCraft then writes Python scripts based on this graph, translating relationships into numerical constraints for asset layout. Next, SceneCraft leverages the perceptual strengths of vision-language foundation models like GPT-V to analyze rendered images and iteratively refine the scene. On top of this process, SceneCraft features a library learning mechanism that compiles common script functions into a reusable library, facilitating continuous self-improvement without expensive LLM parameter tuning. Our evaluation demonstrates that SceneCraft surpasses existing LLM-based agents in rendering complex scenes, as shown by its adherence to constraints and favorable human assessments. We also showcase the broader application potential of SceneCraft by reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as intermediary control signal.
Paper Structure (22 sections, 6 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 22 sections, 6 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Examples comparing SceneCraft's output against a BlenderGPT baseline for different queries.
  • Figure 2: SceneCraft is composed of a dual-loop self-improving pipeline: in the inner-loop, per each scene, an LLM autonomously writes a script to interact with Blender, receives rendered image, and keeps improving the script until getting good scenes; in the outer-loop, SceneCraft summarizes common functions over a batch of written scripts to maintain a reusable design skill library.
  • Figure 3: The workflow of SceneCraft's inner-loop improvement of each scene. 1) given query, a LLM writes a list of assets descriptions, then use CLIP retriever to fetch assets; 2) then LLM decomposes the full query into a sequence of sub-scene, each associated with a subset of assets and a text description; 3) a LLM-Planner generate a relational graph linking assets to spatial relationship; 4) Based on the graph, LLM-Coder writes python codes to get a list of numerical constraints, which can be executed to search optimal layout, and render into image using Blender; 5) LLM-Reviewer with vision perception capability criticize the rendered image, and update the script accordingly. This critic-and-revise procedure can be done multiple times to iteratively improve the script and scene.
  • Figure 4: Example of function $\texttt{parallelism\_score}$ update in outer-loop library learning phase. The update adds constraint score forcing the orientation of the assets to be similar.
  • Figure 5: Predicted 3D Scenes as well as the generated videos by SceneCraft against other baselines.
  • ...and 7 more figures