Table of Contents
Fetching ...

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

Zhenyu Wang, Aoxue Li, Zhenguo Li, Xihui Liu

TL;DR

GenArtist tackles the challenge of generating reliable, diverse images from complex prompts by leveraging a multimodal LLM agent as a coordinating 'brain'. The system decomposes tasks, plans actions via a planning tree with step-by-step verification, and invokes a broad tool library (including position-aware auxiliary tools) to perform generation and editing, followed by autonomous self-correction. Empirical results on T2I-CompBench and MagicBrush demonstrate state-of-the-art performance, with notable gains in attribute binding, spatial relationships, and multi-turn editing reliability. The position-aware execution and verification framework enhances controllability and robustness, offering a practical path toward unified, autonomous image synthesis with broad applicability and potential societal considerations.

Abstract

Despite the success achieved by existing image generation and editing methods, current models still struggle with complex problems including intricate text prompts, and the absence of verification and self-correction mechanisms makes the generated images unreliable. Meanwhile, a single model tends to specialize in particular tasks and possess the corresponding capabilities, making it inadequate for fulfilling all user requirements. We propose GenArtist, a unified image generation and editing system, coordinated by a multimodal large language model (MLLM) agent. We integrate a comprehensive range of existing models into the tool library and utilize the agent for tool selection and execution. For a complex problem, the MLLM agent decomposes it into simpler sub-problems and constructs a tree structure to systematically plan the procedure of generation, editing, and self-correction with step-by-step verification. By automatically generating missing position-related inputs and incorporating position information, the appropriate tool can be effectively employed to address each sub-problem. Experiments demonstrate that GenArtist can perform various generation and editing tasks, achieving state-of-the-art performance and surpassing existing models such as SDXL and DALL-E 3, as can be seen in Fig. 1. Project page is https://zhenyuw16.github.io/GenArtist_page.

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

TL;DR

GenArtist tackles the challenge of generating reliable, diverse images from complex prompts by leveraging a multimodal LLM agent as a coordinating 'brain'. The system decomposes tasks, plans actions via a planning tree with step-by-step verification, and invokes a broad tool library (including position-aware auxiliary tools) to perform generation and editing, followed by autonomous self-correction. Empirical results on T2I-CompBench and MagicBrush demonstrate state-of-the-art performance, with notable gains in attribute binding, spatial relationships, and multi-turn editing reliability. The position-aware execution and verification framework enhances controllability and robustness, offering a practical path toward unified, autonomous image synthesis with broad applicability and potential societal considerations.

Abstract

Despite the success achieved by existing image generation and editing methods, current models still struggle with complex problems including intricate text prompts, and the absence of verification and self-correction mechanisms makes the generated images unreliable. Meanwhile, a single model tends to specialize in particular tasks and possess the corresponding capabilities, making it inadequate for fulfilling all user requirements. We propose GenArtist, a unified image generation and editing system, coordinated by a multimodal large language model (MLLM) agent. We integrate a comprehensive range of existing models into the tool library and utilize the agent for tool selection and execution. For a complex problem, the MLLM agent decomposes it into simpler sub-problems and constructs a tree structure to systematically plan the procedure of generation, editing, and self-correction with step-by-step verification. By automatically generating missing position-related inputs and incorporating position information, the appropriate tool can be effectively employed to address each sub-problem. Experiments demonstrate that GenArtist can perform various generation and editing tasks, achieving state-of-the-art performance and surpassing existing models such as SDXL and DALL-E 3, as can be seen in Fig. 1. Project page is https://zhenyuw16.github.io/GenArtist_page.
Paper Structure (14 sections, 11 figures, 6 tables)

This paper contains 14 sections, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Visualized examples from GenArtist. It can accomplish various tasks, achieving unified generation and editing. For text-to-image generation, it obtains greater accuracy compared to existing models like SDXL and DALL-E 3. For image editing, it also excels in complex editing tasks.
  • Figure 2: The overview of our GenArtist. The MLLM agent is responsible for decomposing problems and planning using a tree structure, then invoking tools to address the issues. Employing the agent as the "brain" effectively realizes a unified generation and editing system.
  • Figure 3: Illustration of the tree for planning. The sub-tree of the "alternative generation tool" node will be adaptively generated after verification, and the sub-tree of the "instruction" node is the same as the left.
  • Figure 4: Visualization of the planning tree for image generation tasks.
  • Figure 5: Visualization of the planning tree for image editing tasks.
  • ...and 6 more figures