Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

Minghao Liu; Le Zhang; Yingjie Tian; Xiaochao Qu; Luoqi Liu; Ting Liu

Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

Minghao Liu, Le Zhang, Yingjie Tian, Xiaochao Qu, Luoqi Liu, Ting Liu

TL;DR

CxD presents a training-free diffusion framework that tackles complex scene generation by leveraging LLM-driven prompting under Complex Decomposition Criteria (CDC). It splits generation into three artist-inspired stages—composition, painting, and retouching—coupled with cross-attention modulation and ControlNet-tile retouching to maintain semantic coherence while enriching detail. Empirical results on SDXL and T2I-CompBench demonstrate state-of-the-art performance for complex prompts, especially in maintaining object relationships and spatial layouts. The approach offers a scalable, modular workflow that can be extended with additional modalities to further enhance controllability.

Abstract

Recent advances in text-to-image diffusion models have demonstrated impressive capabilities in image quality. However, complex scene generation remains relatively unexplored, and even the definition of `complex scene' itself remains unclear. In this paper, we address this gap by providing a precise definition of complex scenes and introducing a set of Complex Decomposition Criteria (CDC) based on this definition. Inspired by the artists painting process, we propose a training-free diffusion framework called Complex Diffusion (CxD), which divides the process into three stages: composition, painting, and retouching. Our method leverages the powerful chain-of-thought capabilities of large language models (LLMs) to decompose complex prompts based on CDC and to manage composition and layout. We then develop an attention modulation method that guides simple prompts to specific regions to complete the complex scene painting. Finally, we inject the detailed output of the LLM into a retouching model to enhance the image details, thus implementing the retouching stage. Extensive experiments demonstrate that our method outperforms previous SOTA approaches, significantly improving the generation of high-quality, semantically consistent, and visually diverse images for complex scenes, even with intricate prompts.

Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

TL;DR

Abstract

Paper Structure (24 sections, 10 equations, 6 figures, 1 table)

This paper contains 24 sections, 10 equations, 6 figures, 1 table.

Introduction
Related Work
Complex Sense Generation
Compositional Diffusion Generation
LLM for Image generation
Complex Scene in Pre-Trained Diffusion
Definition of 'Complex' in Pre-Trained Diffusion
Complex scene Decomposition Criteria (CDC)
Method:CxD
Composition and layout generation with LLMs
Entities extraction
Prompts recaption
Prompts merge or divide
Layout assignment
Cross-Attention Modulation
...and 9 more sections

Figures (6)

Figure 1: Limitations of pre-trained diffusion models in complex scene generation.
Figure 2: Performance trends of the SD XL model with varying numbers of entities and concepts
Figure 3: Overview of CxD framework for complex scene image generation.
Figure 4: The demonstration of each sampling step in CxD.
Figure 5: Qualitative comparison between CxD and SOTA text-to-image models
...and 1 more figures

Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

TL;DR

Abstract

Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

Authors

TL;DR

Abstract

Table of Contents

Figures (6)