Table of Contents
Fetching ...

Layout Agnostic Scene Text Image Synthesis with Diffusion Models

Qilong Zhangli, Jindong Jiang, Di Liu, Licheng Yu, Xiaoliang Dai, Ankit Ramchandani, Guan Pang, Dimitris N. Metaxas, Praveen Krishnan

TL;DR

The novelty of SceneTextGen lies in its integration of three key components: a character-level encoder for capturing detailed typographic properties, coupled with a character-level instance segmentation model and a word-level spotting model to address the issues of unwanted text generation and minor character inaccuracies

Abstract

While diffusion models have significantly advanced the quality of image generation their capability to accurately and coherently render text within these images remains a substantial challenge. Conventional diffusion-based methods for scene text generation are typically limited by their reliance on an intermediate layout output. This dependency often results in a constrained diversity of text styles and fonts an inherent limitation stemming from the deterministic nature of the layout generation phase. To address these challenges this paper introduces SceneTextGen a novel diffusion-based model specifically designed to circumvent the need for a predefined layout stage. By doing so SceneTextGen facilitates a more natural and varied representation of text. The novelty of SceneTextGen lies in its integration of three key components: a character-level encoder for capturing detailed typographic properties coupled with a character-level instance segmentation model and a word-level spotting model to address the issues of unwanted text generation and minor character inaccuracies. We validate the performance of our method by demonstrating improved character recognition rates on generated images across different public visual text datasets in comparison to both standard diffusion based methods and text specific methods.

Layout Agnostic Scene Text Image Synthesis with Diffusion Models

TL;DR

The novelty of SceneTextGen lies in its integration of three key components: a character-level encoder for capturing detailed typographic properties, coupled with a character-level instance segmentation model and a word-level spotting model to address the issues of unwanted text generation and minor character inaccuracies

Abstract

While diffusion models have significantly advanced the quality of image generation their capability to accurately and coherently render text within these images remains a substantial challenge. Conventional diffusion-based methods for scene text generation are typically limited by their reliance on an intermediate layout output. This dependency often results in a constrained diversity of text styles and fonts an inherent limitation stemming from the deterministic nature of the layout generation phase. To address these challenges this paper introduces SceneTextGen a novel diffusion-based model specifically designed to circumvent the need for a predefined layout stage. By doing so SceneTextGen facilitates a more natural and varied representation of text. The novelty of SceneTextGen lies in its integration of three key components: a character-level encoder for capturing detailed typographic properties coupled with a character-level instance segmentation model and a word-level spotting model to address the issues of unwanted text generation and minor character inaccuracies. We validate the performance of our method by demonstrating improved character recognition rates on generated images across different public visual text datasets in comparison to both standard diffusion based methods and text specific methods.
Paper Structure (25 sections, 3 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 25 sections, 3 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Models reliant on predefined text layouts for input exhibit limitations such as constrained font diversity and static text positioning during each inference, leading to a lack of variability in style and arrangement.
  • Figure 2: Scene Text Generation. Qualitative samples of scene text images generated by our model are presented. These images contain visually appealing texts that are coherent with the background and are created without relying on any spatial information or predefined layouts as input, thereby enhancing the Text-to-Image (T2I) Diffusion Model's capability to generate text.
  • Figure 3: Model Framework: SceneTextGen employs a character-level encoder to extract detailed character-specific features. During loss computation, the model leverages both word-level and character-level supervisions to guide the recovery of the image, in addition to the standard denoising loss. This dual-level supervision enhances the model's ability to accurately generate and refine text within scenes.
  • Figure 4: Comparative visualization of generated images. We present a side-by-side comparison of images generated from the same text prompt across different existing methods (with results generated by TextDiffuser as a proxy. Human faces are blurred for ethical considerations.). Each row corresponds to a unique prompt, showcasing the visual quality, text clarity, and contextual coherence achieved by each method.
  • Figure 5: t-SNE representation of text region embeddings derived from the penultimate layer features of a font recognition model.
  • ...and 3 more figures