Table of Contents
Fetching ...

CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design

Hui Zhang, Dexiang Hong, Maoke Yang, Yutao Cheng, Zhao Zhang, Jie Shao, Xinglong Wu, Zuxuan Wu, Yu-Gang Jiang

TL;DR

CreatiDesign addresses the problem of generating graphic designs from multiple heterogeneous conditions by formalizing graphic design as $I_g=f(P,I_s,L)$, where $P$ is a global prompt, $I_s$ the multi-subject image condition, and $L$ the semantic layout. It proposes a unified multi-condition diffusion-transformer architecture with native encoders and multimodal attention, plus a multimodal attention mask system to prevent leakage and enable precise control. It also introduces a fully automated dataset pipeline that yields 400K annotated designs and a comprehensive benchmark. Experimental results show state-of-the-art performance across multi-subject preservation, semantic layout alignment, and overall image quality, with strong qualitative and user-study support, enabling scalable, intent-driven graphic design and robust editing capabilities.

Abstract

Graphic design plays a vital role in visual communication across advertising, marketing, and multimedia entertainment. Prior work has explored automated graphic design generation using diffusion models, aiming to streamline creative workflows and democratize design capabilities. However, complex graphic design scenarios require accurately adhering to design intent specified by multiple heterogeneous user-provided elements (\eg images, layouts, and texts), which pose multi-condition control challenges for existing methods. Specifically, previous single-condition control models demonstrate effectiveness only within their specialized domains but fail to generalize to other conditions, while existing multi-condition methods often lack fine-grained control over each sub-condition and compromise overall compositional harmony. To address these limitations, we introduce CreatiDesign, a systematic solution for automated graphic design covering both model architecture and dataset construction. First, we design a unified multi-condition driven architecture that enables flexible and precise integration of heterogeneous design elements with minimal architectural modifications to the base diffusion model. Furthermore, to ensure that each condition precisely controls its designated image region and to avoid interference between conditions, we propose a multimodal attention mask mechanism. Additionally, we develop a fully automated pipeline for constructing graphic design datasets, and introduce a new dataset with 400K samples featuring multi-condition annotations, along with a comprehensive benchmark. Experimental results show that CreatiDesign outperforms existing models by a clear margin in faithfully adhering to user intent.

CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design

TL;DR

CreatiDesign addresses the problem of generating graphic designs from multiple heterogeneous conditions by formalizing graphic design as , where is a global prompt, the multi-subject image condition, and the semantic layout. It proposes a unified multi-condition diffusion-transformer architecture with native encoders and multimodal attention, plus a multimodal attention mask system to prevent leakage and enable precise control. It also introduces a fully automated dataset pipeline that yields 400K annotated designs and a comprehensive benchmark. Experimental results show state-of-the-art performance across multi-subject preservation, semantic layout alignment, and overall image quality, with strong qualitative and user-study support, enabling scalable, intent-driven graphic design and robust editing capabilities.

Abstract

Graphic design plays a vital role in visual communication across advertising, marketing, and multimedia entertainment. Prior work has explored automated graphic design generation using diffusion models, aiming to streamline creative workflows and democratize design capabilities. However, complex graphic design scenarios require accurately adhering to design intent specified by multiple heterogeneous user-provided elements (\eg images, layouts, and texts), which pose multi-condition control challenges for existing methods. Specifically, previous single-condition control models demonstrate effectiveness only within their specialized domains but fail to generalize to other conditions, while existing multi-condition methods often lack fine-grained control over each sub-condition and compromise overall compositional harmony. To address these limitations, we introduce CreatiDesign, a systematic solution for automated graphic design covering both model architecture and dataset construction. First, we design a unified multi-condition driven architecture that enables flexible and precise integration of heterogeneous design elements with minimal architectural modifications to the base diffusion model. Furthermore, to ensure that each condition precisely controls its designated image region and to avoid interference between conditions, we propose a multimodal attention mask mechanism. Additionally, we develop a fully automated pipeline for constructing graphic design datasets, and introduce a new dataset with 400K samples featuring multi-condition annotations, along with a comprehensive benchmark. Experimental results show that CreatiDesign outperforms existing models by a clear margin in faithfully adhering to user intent.

Paper Structure

This paper contains 35 sections, 2 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: CreatiDesign generates high-quality graphic designs based on user-provided image assets and semantic layouts, covering a wide range of categories such as movie posters, brand promotions, product advertisements, and social media content.
  • Figure 2: An overview of our motivation. Graphic design is a multi-condition driven generation task that requires the precise and harmonious arrangement of heterogeneous elements, including primary visual elements (provided as images with positions), as well as secondary visual and textual elements (both specified by semantic descriptions and positions). Previous methods either support only a single type of condition (e.g. image-driven or layout-driven models) or lack accurate control over each sub-condition(e.g. multi-condition driven models), resulting in failure to strictly adhere to user design intent, as highlighted by the red and purple masks.
  • Figure 2: Ablation study: quantitative analysis of key components in CreatiDesign.
  • Figure 3: An overview of the architecture. CreatiDesign integrates image and semantic layout conditions through native multimodal attention. Multimodal attention mask ensures that each condition precisely controls its designated image regions while preventing leakage between conditions.
  • Figure 4: Automated pipeline for graphic design dataset construction.
  • ...and 4 more figures