Table of Contents
Fetching ...

MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Yanbo Ding, Shaobin Zhuang, Kunchang Li, Zhengrong Yue, Yu Qiao, Yali Wang

TL;DR

The paper tackles 3D-aware image generation from natural language, addressing limitations of existing text-to-image methods in handling multi-object scenes with accurate 3D relations. It proposes MUSES, a modular three-agent system consisting of Layout Manager, Model Engineer, and Image Artist that collectively plan 3D layouts, retrieve and calibrate 3D models, and render 3D-to-2D conditioned images. It introduces T2I-3DisBench, a dataset of 50 prompts describing complex 3D scenes, enabling evaluation of object count, orientation, 3D spatial relationships, and camera view, and demonstrates state-of-the-art results on T2I-CompBench and T2I-3DisBench, outperforming Stable Diffusion v3 and DALL-E 3. The work advances bridging natural language, 2D image generation, and 3D world by explicit 3D planning and multi-modal agent collaboration, and provides a reproducible pipeline.

Abstract

Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in the 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries. Specifically, our MUSES addresses this challenging task by developing a progressive workflow with three key components, including (1) Layout Manager for 2D-to-3D layout lifting, (2) Model Engineer for 3D object acquisition and calibration, (3) Image Artist for 3D-to-2D image rendering. By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation. Additionally, we find that existing benchmarks lack detailed descriptions of complex 3D spatial relationships of multiple objects. To fill this gap, we further construct a new benchmark of T2I-3DisBench (3D image scene), which describes diverse 3D image scenes with 50 detailed prompts. Extensive experiments show the state-of-the-art performance of MUSES on both T2I-CompBench and T2I-3DisBench, outperforming recent strong competitors such as DALL-E 3 and Stable Diffusion 3. These results demonstrate a significant step of MUSES forward in bridging natural language, 2D image generation, and 3D world. Our codes are available at the following link: https://github.com/DINGYANB/MUSES.

MUSES: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

TL;DR

The paper tackles 3D-aware image generation from natural language, addressing limitations of existing text-to-image methods in handling multi-object scenes with accurate 3D relations. It proposes MUSES, a modular three-agent system consisting of Layout Manager, Model Engineer, and Image Artist that collectively plan 3D layouts, retrieve and calibrate 3D models, and render 3D-to-2D conditioned images. It introduces T2I-3DisBench, a dataset of 50 prompts describing complex 3D scenes, enabling evaluation of object count, orientation, 3D spatial relationships, and camera view, and demonstrates state-of-the-art results on T2I-CompBench and T2I-3DisBench, outperforming Stable Diffusion v3 and DALL-E 3. The work advances bridging natural language, 2D image generation, and 3D world by explicit 3D planning and multi-modal agent collaboration, and provides a reproducible pipeline.

Abstract

Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in the 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries. Specifically, our MUSES addresses this challenging task by developing a progressive workflow with three key components, including (1) Layout Manager for 2D-to-3D layout lifting, (2) Model Engineer for 3D object acquisition and calibration, (3) Image Artist for 3D-to-2D image rendering. By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation. Additionally, we find that existing benchmarks lack detailed descriptions of complex 3D spatial relationships of multiple objects. To fill this gap, we further construct a new benchmark of T2I-3DisBench (3D image scene), which describes diverse 3D image scenes with 50 detailed prompts. Extensive experiments show the state-of-the-art performance of MUSES on both T2I-CompBench and T2I-3DisBench, outperforming recent strong competitors such as DALL-E 3 and Stable Diffusion 3. These results demonstrate a significant step of MUSES forward in bridging natural language, 2D image generation, and 3D world. Our codes are available at the following link: https://github.com/DINGYANB/MUSES.
Paper Structure (21 sections, 13 figures, 3 tables)

This paper contains 21 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Comparison Results With Various Methods. Our MUSES achieves the best, with object numbers highlighted in brown, object orientations in yellow, 3D spatial relationships in blue, and camera views in green, outperforming both open-sourced state-of-the-art methods and commercial API products, such as Stable Diffusion V3, DALL-E 3, and Midjourney v6.0.
  • Figure 2: Overview of our MUSES. Based on the user input, Layout Manager first plans a 2D layout and lifts it to a 3D one. Then, Model Engineer acquires 3D models of query objects and aligns them to face the camera. Finally, Image Artist assembles all the 3D object models into visual conditions that are used for final controllable image generation.
  • Figure 3: 2D-to-3D Layout Manager. First, based on the user input, our layout manager employs the LLM to plan 2D layout through In-context Learning. Then, it lifts the 2D layout to 3D space via Chain of Thought reasoning.
  • Figure 4: 3D Model Retriever. We develop a retrieve-generate decision tree that automatically acquires 3D objects specified in the 3D layout from a self-collected model shop, based on a concise decision process of online search and text-to-3D generation.
  • Figure 5: 3D Model Aligner. It aligns 3D models with face-camera orientation, ensuring that the final orientation conforms to the planned 3D layout. First, we fine-tune CLIP as a Face-Camera Classifier, by a training set generated from our 3D model shop. Then, we use fine-tuned CLIP to identify the face-camera image of each object, aligning its 3D model to face the camera.
  • ...and 8 more figures