CreatiDesign: A Unified Multi-Conditional Diffusion Transformer for Creative Graphic Design
Hui Zhang, Dexiang Hong, Maoke Yang, Yutao Cheng, Zhao Zhang, Jie Shao, Xinglong Wu, Zuxuan Wu, Yu-Gang Jiang
TL;DR
CreatiDesign addresses the problem of generating graphic designs from multiple heterogeneous conditions by formalizing graphic design as $I_g=f(P,I_s,L)$, where $P$ is a global prompt, $I_s$ the multi-subject image condition, and $L$ the semantic layout. It proposes a unified multi-condition diffusion-transformer architecture with native encoders and multimodal attention, plus a multimodal attention mask system to prevent leakage and enable precise control. It also introduces a fully automated dataset pipeline that yields 400K annotated designs and a comprehensive benchmark. Experimental results show state-of-the-art performance across multi-subject preservation, semantic layout alignment, and overall image quality, with strong qualitative and user-study support, enabling scalable, intent-driven graphic design and robust editing capabilities.
Abstract
Graphic design plays a vital role in visual communication across advertising, marketing, and multimedia entertainment. Prior work has explored automated graphic design generation using diffusion models, aiming to streamline creative workflows and democratize design capabilities. However, complex graphic design scenarios require accurately adhering to design intent specified by multiple heterogeneous user-provided elements (\eg images, layouts, and texts), which pose multi-condition control challenges for existing methods. Specifically, previous single-condition control models demonstrate effectiveness only within their specialized domains but fail to generalize to other conditions, while existing multi-condition methods often lack fine-grained control over each sub-condition and compromise overall compositional harmony. To address these limitations, we introduce CreatiDesign, a systematic solution for automated graphic design covering both model architecture and dataset construction. First, we design a unified multi-condition driven architecture that enables flexible and precise integration of heterogeneous design elements with minimal architectural modifications to the base diffusion model. Furthermore, to ensure that each condition precisely controls its designated image region and to avoid interference between conditions, we propose a multimodal attention mask mechanism. Additionally, we develop a fully automated pipeline for constructing graphic design datasets, and introduce a new dataset with 400K samples featuring multi-condition annotations, along with a comprehensive benchmark. Experimental results show that CreatiDesign outperforms existing models by a clear margin in faithfully adhering to user intent.
