Table of Contents
Fetching ...

3Dify: a Framework for Procedural 3D-CG Generation Assisted by LLMs Using MCP and RAG

Shun-ichiro Hayashi, Daichi Mukunoki, Tetsuya Hoshino, Satoshi Ohshima, Takahiro Katagiri

TL;DR

3Dify presents a framework for procedural 3D-CG generation driven by natural language, integrating MCP and RAG to automate DCC tools while enabling interactive image-based feedback and support for local LLMs. The architecture employs three specialized LLM agents (Visualizer, Planner, Manager) and a multi-turn Chatflow with RAG, plus MCP/CUA pathways and a manual fallback, to achieve end-to-end generation without manual tool operation. Key contributions include a scalable, tool-agnostic workflow, an image-selection refinement loop for learning generation patterns, and the ability to run locally or with custom models, reducing API costs and data exposure. The demonstrated pipeline, though showing some spatial-coherence challenges, highlights practical feasibility and extensibility across DCC tools, with open-source release enabling broader adoption and adaptation to related tasks beyond 3D-CG.

Abstract

This paper proposes "3Dify," a procedural 3D computer graphics (3D-CG) generation framework utilizing Large Language Models (LLMs). The framework enables users to generate 3D-CG content solely through natural language instructions. 3Dify is built upon Dify, an open-source platform for AI application development, and incorporates several state-of-the-art LLM-related technologies such as the Model Context Protocol (MCP) and Retrieval-Augmented Generation (RAG). For 3D-CG generation support, 3Dify automates the operation of various Digital Content Creation (DCC) tools via MCP. When DCC tools do not support MCP-based interaction, the framework employs the Computer-Using Agent (CUA) method to automate Graphical User Interface (GUI) operations. Moreover, to enhance image generation quality, 3Dify allows users to provide feedback by selecting preferred images from multiple candidates. The LLM then learns variable patterns from these selections and applies them to subsequent generations. Furthermore, 3Dify supports the integration of locally deployed LLMs, enabling users to utilize custom-developed models and to reduce both time and monetary costs associated with external API calls by leveraging their own computational resources.

3Dify: a Framework for Procedural 3D-CG Generation Assisted by LLMs Using MCP and RAG

TL;DR

3Dify presents a framework for procedural 3D-CG generation driven by natural language, integrating MCP and RAG to automate DCC tools while enabling interactive image-based feedback and support for local LLMs. The architecture employs three specialized LLM agents (Visualizer, Planner, Manager) and a multi-turn Chatflow with RAG, plus MCP/CUA pathways and a manual fallback, to achieve end-to-end generation without manual tool operation. Key contributions include a scalable, tool-agnostic workflow, an image-selection refinement loop for learning generation patterns, and the ability to run locally or with custom models, reducing API costs and data exposure. The demonstrated pipeline, though showing some spatial-coherence challenges, highlights practical feasibility and extensibility across DCC tools, with open-source release enabling broader adoption and adaptation to related tasks beyond 3D-CG.

Abstract

This paper proposes "3Dify," a procedural 3D computer graphics (3D-CG) generation framework utilizing Large Language Models (LLMs). The framework enables users to generate 3D-CG content solely through natural language instructions. 3Dify is built upon Dify, an open-source platform for AI application development, and incorporates several state-of-the-art LLM-related technologies such as the Model Context Protocol (MCP) and Retrieval-Augmented Generation (RAG). For 3D-CG generation support, 3Dify automates the operation of various Digital Content Creation (DCC) tools via MCP. When DCC tools do not support MCP-based interaction, the framework employs the Computer-Using Agent (CUA) method to automate Graphical User Interface (GUI) operations. Moreover, to enhance image generation quality, 3Dify allows users to provide feedback by selecting preferred images from multiple candidates. The LLM then learns variable patterns from these selections and applies them to subsequent generations. Furthermore, 3Dify supports the integration of locally deployed LLMs, enabling users to utilize custom-developed models and to reduce both time and monetary costs associated with external API calls by leveraging their own computational resources.

Paper Structure

This paper contains 16 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Position of 3Dify in related software.
  • Figure 2: System architecture diagram of the 3Dify framework
  • Figure 3: Detailed view of 3Dify's LLM agents
  • Figure 4: Image-selection feedback loop.
  • Figure 5: Example of Chatflow in 3Dify
  • ...and 4 more figures