Table of Contents
Fetching ...

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

Yiming Zhao, Zhouhui Lian

TL;DR

UDiffText tackles the persistent problem of spelling and glyph accuracy in text-enabled diffusion synthesis. It introduces a lightweight character-level text encoder to replace CLIP, and fuses DSM with local-attention and scene-text recognition losses to train a cross-attention-guided diffusion model that renders text faithfully within arbitrary imagery. A noised-latent refinement at inference further mitigates catastrophic neglect and improves sequence accuracy. The approach achieves superior text rendering and scene coherence across reconstruction and editing tasks, with demonstrated potential for accurate T2I generation and broader text-centric image synthesis applications. The work provides practical gains for large-scale, text-aware image synthesis and editing pipelines.

Abstract

Text-to-Image (T2I) generation methods based on diffusion model have garnered significant attention in the last few years. Although these image synthesis methods produce visually appealing results, they frequently exhibit spelling errors when rendering text within the generated images. Such errors manifest as missing, incorrect or extraneous characters, thereby severely constraining the performance of text image generation based on diffusion models. To address the aforementioned issue, this paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model (i.e., Stable Diffusion [27]). Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder and provides more robust text embeddings as conditional guidance. Then, we fine-tune the diffusion model using a large-scale dataset, incorporating local attention control under the supervision of character-level segmentation maps. Finally, by employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images. Both qualitative and quantitative results demonstrate the superiority of our method to the state of the art. Furthermore, we showcase several potential applications of the proposed UDiffText, including text-centric image synthesis, scene text editing, etc. Code and model will be available at https://github.com/ZYM-PKU/UDiffText .

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

TL;DR

UDiffText tackles the persistent problem of spelling and glyph accuracy in text-enabled diffusion synthesis. It introduces a lightweight character-level text encoder to replace CLIP, and fuses DSM with local-attention and scene-text recognition losses to train a cross-attention-guided diffusion model that renders text faithfully within arbitrary imagery. A noised-latent refinement at inference further mitigates catastrophic neglect and improves sequence accuracy. The approach achieves superior text rendering and scene coherence across reconstruction and editing tasks, with demonstrated potential for accurate T2I generation and broader text-centric image synthesis applications. The work provides practical gains for large-scale, text-aware image synthesis and editing pipelines.

Abstract

Text-to-Image (T2I) generation methods based on diffusion model have garnered significant attention in the last few years. Although these image synthesis methods produce visually appealing results, they frequently exhibit spelling errors when rendering text within the generated images. Such errors manifest as missing, incorrect or extraneous characters, thereby severely constraining the performance of text image generation based on diffusion models. To address the aforementioned issue, this paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model (i.e., Stable Diffusion [27]). Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder and provides more robust text embeddings as conditional guidance. Then, we fine-tune the diffusion model using a large-scale dataset, incorporating local attention control under the supervision of character-level segmentation maps. Finally, by employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images. Both qualitative and quantitative results demonstrate the superiority of our method to the state of the art. Furthermore, we showcase several potential applications of the proposed UDiffText, including text-centric image synthesis, scene text editing, etc. Code and model will be available at https://github.com/ZYM-PKU/UDiffText .
Paper Structure (22 sections, 8 equations, 12 figures, 2 tables, 1 algorithm)

This paper contains 22 sections, 8 equations, 12 figures, 2 tables, 1 algorithm.

Figures (12)

  • Figure 1: The proposed UDiffText is capable of synthesizing accurate and harmonious text in either synthetic or real-word images, thus can be applied to tasks like scene text editing (a), arbitrary text generation (b) and accurate T2I generation (c).
  • Figure 2: Text rendering problems of T2I models. The prompt we use is "A signboard near the highway that says 'Cyberpunk Night City'". Word spelling errors are commonly seen in images generated by Stable Diffusion XL, DALL-E 3 and Midjourneyai.
  • Figure 3: An overview of the training process of our proposed UDiffText. We build our model based on the inpainting version of Stable Diffusion (v2.0). A character-level (CL) text encoder is utilized to obtain robust embeddings from the text to be rendered. We train the model using denoising score matching (DSM) together with the local attention loss calculated based on character-level segmentation maps and the auxiliary scene text recognition loss. Note that only the parameters of cross-attention (CA) blocks are updated during training.
  • Figure 4: The network architecture of our character-level text encoder. A codebook is employed to translate the character indices into a sequence of learnable embeddings. These embeddings are enhanced by position embeddings and then input into a transformer to generate the encoded output.
  • Figure 5: Qualitative results on the scene/document/poster text editing task. The first row consists of the original images, while the second row comprises the input images with binary masks applied to the text region. The specific word to be generated is indicated at the top of each column.
  • ...and 7 more figures