Table of Contents
Fetching ...

MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion

Sen Li, Ruochen Wang, Cho-Jui Hsieh, Minhao Cheng, Tianyi Zhou

TL;DR

MuLan tackles the challenge of generating images with multiple objects and precise spatial relations by introducing a training-free, progressive diffusion framework guided by a Multimodal-LLM agent. It decomposes complex prompts into object-centric sub-prompts via LLM planning, generates each object sequentially with attention-guided diffusion conditioned on prior objects, and uses a VLM feedback loop to ensure prompt fidelity at every step. The approach includes a robust overlap-handling strategy and enables human-in-the-loop edits during generation, improving controllability and collaboration. Experimental results on a curated multi-object dataset show MuLan outperforms baselines in object completeness, attribute bindings, and spatial accuracy, with ablations confirming the critical roles of VLM feedback and mid-level attention blocks, highlighting its practical impact for controllable diffusion and interactive content creation.

Abstract

Existing text-to-image models still struggle to generate images of multiple objects, especially in handling their spatial positions, relative sizes, overlapping, and attribute bindings. To efficiently address these challenges, we develop a training-free Multimodal-LLM agent (MuLan), as a human painter, that can progressively generate multi-object with intricate planning and feedback control. MuLan harnesses a large language model (LLM) to decompose a prompt to a sequence of sub-tasks, each generating only one object by stable diffusion, conditioned on previously generated objects. Unlike existing LLM-grounded methods, MuLan only produces a high-level plan at the beginning while the exact size and location of each object are determined upon each sub-task by an LLM and attention guidance. Moreover, MuLan adopts a vision-language model (VLM) to provide feedback to the image generated in each sub-task and control the diffusion model to re-generate the image if it violates the original prompt. Hence, each model in every step of MuLan only needs to address an easy sub-task it is specialized for. The multi-step process also allows human users to monitor the generation process and make preferred changes at any intermediate step via text prompts, thereby improving the human-AI collaboration experience. We collect 200 prompts containing multi-objects with spatial relationships and attribute bindings from different benchmarks to evaluate MuLan. The results demonstrate the superiority of MuLan in generating multiple objects over baselines and its creativity when collaborating with human users. The code is available at https://github.com/measure-infinity/mulan-code.

MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion

TL;DR

MuLan tackles the challenge of generating images with multiple objects and precise spatial relations by introducing a training-free, progressive diffusion framework guided by a Multimodal-LLM agent. It decomposes complex prompts into object-centric sub-prompts via LLM planning, generates each object sequentially with attention-guided diffusion conditioned on prior objects, and uses a VLM feedback loop to ensure prompt fidelity at every step. The approach includes a robust overlap-handling strategy and enables human-in-the-loop edits during generation, improving controllability and collaboration. Experimental results on a curated multi-object dataset show MuLan outperforms baselines in object completeness, attribute bindings, and spatial accuracy, with ablations confirming the critical roles of VLM feedback and mid-level attention blocks, highlighting its practical impact for controllable diffusion and interactive content creation.

Abstract

Existing text-to-image models still struggle to generate images of multiple objects, especially in handling their spatial positions, relative sizes, overlapping, and attribute bindings. To efficiently address these challenges, we develop a training-free Multimodal-LLM agent (MuLan), as a human painter, that can progressively generate multi-object with intricate planning and feedback control. MuLan harnesses a large language model (LLM) to decompose a prompt to a sequence of sub-tasks, each generating only one object by stable diffusion, conditioned on previously generated objects. Unlike existing LLM-grounded methods, MuLan only produces a high-level plan at the beginning while the exact size and location of each object are determined upon each sub-task by an LLM and attention guidance. Moreover, MuLan adopts a vision-language model (VLM) to provide feedback to the image generated in each sub-task and control the diffusion model to re-generate the image if it violates the original prompt. Hence, each model in every step of MuLan only needs to address an easy sub-task it is specialized for. The multi-step process also allows human users to monitor the generation process and make preferred changes at any intermediate step via text prompts, thereby improving the human-AI collaboration experience. We collect 200 prompts containing multi-objects with spatial relationships and attribute bindings from different benchmarks to evaluate MuLan. The results demonstrate the superiority of MuLan in generating multiple objects over baselines and its creativity when collaborating with human users. The code is available at https://github.com/measure-infinity/mulan-code.
Paper Structure (35 sections, 7 equations, 10 figures, 5 tables, 1 algorithm)

This paper contains 35 sections, 7 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: The proposed training-free Multimodal-LLM Agent (MuLan) for Progressive Multi-Object Diffusion. MuLan consists of three main components: (1) LLM planning; (2) Single-object diffusion with attention guidance; and (3) VLM-feedback control. MuLan first decomposes a complicated prompt into a sequence of sub-prompts each for one object, and then generates one object per step conditioned on a sub-prompt and previously generated objects, where LLM plans the rough layout of the object and attention guidance provides an accurate mask for it. The VLM-feedback control allows MuLan to correct mistakes in each step by adjusting hyperparameters in (2).
  • Figure 2: Examples of MuLan-generated images, compared to the original SD-v1.4 rombach2022high, the original SDXL podell2023sdxl, Structure diffusion feng2022training, Promptist hao2022optimizing, and PixArt-$\alpha$chen2023pixart.
  • Figure 3: Single object diffusion with LLM planning and attention guidance for $\texttt{obj}_n$ (detailed procedure in Algorithm \ref{['alg:single-object']} in Appendix \ref{['sec:alg']}).
  • Figure 4: An illustration tree for difference cases of human-agent interaction during generation. The middle branch (connected by blue arrows) shows the original generation process without human-agent interaction. The top and bottom branches show different complex composed human-agent interaction during generation for various adjustments, involving object adjustments, attribute adjustments, and spatial relationship adjustments, which demonstrate the flexibility and effectiveness of MuLan for human-agent interaction during generation.
  • Figure 5: More qualitative examples of images generated by different methods on intricate prompts.
  • ...and 5 more figures