Table of Contents
Fetching ...

CustomText: Customized Textual Image Generation using Diffusion Models

Shubham Paliwal, Arushi Jain, Monika Sharma, Vikram Jamwal, Lovekesh Vig

TL;DR

CustomText tackles the challenge of rendering accurately styled text in diffusion-based image synthesis by introducing a two-stage pipeline that explicitly controls font attributes via a character-mask and a conditional attribute mask. It combines a Layout Transformer-driven stage with a TextDiffuser-based second stage, and augments small-text fidelity using a VAE-based decoder enhance path and a ControlNet-augmented Consistency Decoder guided by character maps. Empirical results on CTW-1500 and a dedicated SmallFontSize dataset show improved reconstruction metrics and OCR readability, along with strong, controllable font-attribute rendering, compared to prior textual-image methods. The approach enables practical applications in advertising and design workflows and suggests future work on multilingual support and larger-scale training.

Abstract

Textual image generation spans diverse fields like advertising, education, product packaging, social media, information visualization, and branding. Despite recent strides in language-guided image synthesis using diffusion models, current models excel in image generation but struggle with accurate text rendering and offer limited control over font attributes. In this paper, we aim to enhance the synthesis of high-quality images with precise text customization, thereby contributing to the advancement of image generation models. We call our proposed method CustomText. Our implementation leverages a pre-trained TextDiffuser model to enable control over font color, background, and types. Additionally, to address the challenge of accurately rendering small-sized fonts, we train the ControlNet model for a consistency decoder, significantly enhancing text-generation performance. We assess the performance of CustomText in comparison to previous methods of textual image generation on the publicly available CTW-1500 dataset and a self-curated dataset for small-text generation, showcasing superior results.

CustomText: Customized Textual Image Generation using Diffusion Models

TL;DR

CustomText tackles the challenge of rendering accurately styled text in diffusion-based image synthesis by introducing a two-stage pipeline that explicitly controls font attributes via a character-mask and a conditional attribute mask. It combines a Layout Transformer-driven stage with a TextDiffuser-based second stage, and augments small-text fidelity using a VAE-based decoder enhance path and a ControlNet-augmented Consistency Decoder guided by character maps. Empirical results on CTW-1500 and a dedicated SmallFontSize dataset show improved reconstruction metrics and OCR readability, along with strong, controllable font-attribute rendering, compared to prior textual-image methods. The approach enables practical applications in advertising and design workflows and suggests future work on multilingual support and larger-scale training.

Abstract

Textual image generation spans diverse fields like advertising, education, product packaging, social media, information visualization, and branding. Despite recent strides in language-guided image synthesis using diffusion models, current models excel in image generation but struggle with accurate text rendering and offer limited control over font attributes. In this paper, we aim to enhance the synthesis of high-quality images with precise text customization, thereby contributing to the advancement of image generation models. We call our proposed method CustomText. Our implementation leverages a pre-trained TextDiffuser model to enable control over font color, background, and types. Additionally, to address the challenge of accurately rendering small-sized fonts, we train the ControlNet model for a consistency decoder, significantly enhancing text-generation performance. We assess the performance of CustomText in comparison to previous methods of textual image generation on the publicly available CTW-1500 dataset and a self-curated dataset for small-text generation, showcasing superior results.
Paper Structure (13 sections, 3 equations, 9 figures, 3 tables)

This paper contains 13 sections, 3 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: An example illustrating the use of our proposed CustomText method. It is a simulation of an advertisement designing workflow for an Ad campaign "Adopt a Pet", where initial image (A) is generated using text-prompt, P by stable diffusion base model sdxl. Subsequently, the user can customize the font attributes of the text by using a user-interface (B). User can also perform incremental editing i.e., append or remove extra text lines, by using space character (" ") on top of visible texts (C). The process can repeat until the end-user is satisfied with the final generated results (D).
  • Figure 2: Example demonstrating the control of our proposed method CustomText over fonts color, fonts types and fonts background on the base image.
  • Figure 3: Stage 1 pipeline for the generation of character mask ($M_{char}$) and conditional mask ($M_{cond}$) using input textual prompt and control parameters defining font-attributes. The transformer encoder-decoder architecture takes input prompt and for each word, unique non-overlapping bounding box is extracted. The input control parameters define the different font-attributes such as color, type, background, which enable the renderer to generate the desired conditional mask.
  • Figure 4: Stage 2 pipeline of our proposed CustomText method for generating images with the desired text attributes by using character mask ($M_{char}$), conditional mask ($M_{cond}$) and textual prompt (P). Please note that the white region in mask $m$ represents the region where the user wants to perform generation.
  • Figure 5: An example use-case of textual-image generation such as an advertisement with text "Adapt a Shelter Pet today". The images \ref{['fig:comapre_a']}, \ref{['fig:comapre_b']} and \ref{['fig:comapre_c']} are generated using TextDiffuser decoder, DALLE-3 Consistency decoder and our proposed CustomText decoder, respectively. The CustomText shows superior control over accurate text-generation as it is evident that it writes "Shelter" accurately in comparison to other methods.
  • ...and 4 more figures