Table of Contents
Fetching ...

DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models

Zhendong Wang, Jianmin Bao, Shuyang Gu, Dong Chen, Wengang Zhou, Houqiang Li

TL;DR

DesignDiffusion tackles the challenge of generating design images with accurate, embedded text by introducing an end-to-end diffusion framework that learns both visuals and visual text from prompts. It augments CLIP prompting with rendered-character information, enforces character-accurate localization via a cross-attention loss, and uses Self-Play Direct Preference Optimization to refine text rendering without human-annotated preferences. The method achieves state-of-the-art results in image quality (FID) and visual-text accuracy (OCR metrics) on a large, design-focused dataset, outperforming contemporary text-to-image and visual-text baselines. This approach enables cohesive, design-grade images with integrated typography directly from natural language prompts, enabling scalable design generation workflows.

Abstract

In this paper, we present DesignDiffusion, a simple yet effective framework for the novel task of synthesizing design images from textual descriptions. A primary challenge lies in generating accurate and style-consistent textual and visual content. Existing works in a related task of visual text generation often focus on generating text within given specific regions, which limits the creativity of generation models, resulting in style or color inconsistencies between textual and visual elements if applied to design image generation. To address this issue, we propose an end-to-end, one-stage diffusion-based framework that avoids intricate components like position and layout modeling. Specifically, the proposed framework directly synthesizes textual and visual design elements from user prompts. It utilizes a distinctive character embedding derived from the visual text to enhance the input prompt, along with a character localization loss for enhanced supervision during text generation. Furthermore, we employ a self-play Direct Preference Optimization fine-tuning strategy to improve the quality and accuracy of the synthesized visual text. Extensive experiments demonstrate that DesignDiffusion achieves state-of-the-art performance in design image generation.

DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models

TL;DR

DesignDiffusion tackles the challenge of generating design images with accurate, embedded text by introducing an end-to-end diffusion framework that learns both visuals and visual text from prompts. It augments CLIP prompting with rendered-character information, enforces character-accurate localization via a cross-attention loss, and uses Self-Play Direct Preference Optimization to refine text rendering without human-annotated preferences. The method achieves state-of-the-art results in image quality (FID) and visual-text accuracy (OCR metrics) on a large, design-focused dataset, outperforming contemporary text-to-image and visual-text baselines. This approach enables cohesive, design-grade images with integrated typography directly from natural language prompts, enabling scalable design generation workflows.

Abstract

In this paper, we present DesignDiffusion, a simple yet effective framework for the novel task of synthesizing design images from textual descriptions. A primary challenge lies in generating accurate and style-consistent textual and visual content. Existing works in a related task of visual text generation often focus on generating text within given specific regions, which limits the creativity of generation models, resulting in style or color inconsistencies between textual and visual elements if applied to design image generation. To address this issue, we propose an end-to-end, one-stage diffusion-based framework that avoids intricate components like position and layout modeling. Specifically, the proposed framework directly synthesizes textual and visual design elements from user prompts. It utilizes a distinctive character embedding derived from the visual text to enhance the input prompt, along with a character localization loss for enhanced supervision during text generation. Furthermore, we employ a self-play Direct Preference Optimization fine-tuning strategy to improve the quality and accuracy of the synthesized visual text. Extensive experiments demonstrate that DesignDiffusion achieves state-of-the-art performance in design image generation.

Paper Structure

This paper contains 18 sections, 6 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Images generated by our DesignDiffusion model, which only requires a simple text prompt as input and can generate diverse, high-quality design images with accurate textual and vivid visual content.
  • Figure 2: Overview of the DesignDiffusion framework. DesignDiffusion is based on enhanced text prompts, with trainable CLIP text encoder and UNet, and does not require additional complex conditions (glyphs, positions). Character localization loss is added as extra supervision at cross-attention maps to force the UNet to attend more to the visual character regions. To further improve the quality of visual text generation, we incorporate a self-play DPO strategy into the fine-tuning process. Diffusion denoise loss is omitted here.
  • Figure 3: Qualitative comparisons with previous state-of-the-art text-to-image generation and visual text rendering methods reveal that our DesignDiffusion produces more elegant and harmonious integrated visual and textual design images.
  • Figure 4: Visual comparisons of the capabilities of automatic text layout planning from our model with those of planning by language models. Upon training, our DesignDiffusion has demonstrated the ability to generate flexible and well-organized text layouts effectively.
  • Figure 5: Examples for comparing the OCR capability for detecting text in design images. Red part denotes wrong detection. LLaVA 1.6 gets the best OCR accuracy for extracting text in design images.
  • ...and 5 more figures