Table of Contents
Fetching ...

Visual Text Generation in the Wild

Yuanzhi Zhu, Jiawei Liu, Feiyu Gao, Wenyu Liu, Xinggang Wang, Peng Wang, Fei Huang, Cong Yao, Zhibo Yang

TL;DR

Extensive experiments demonstrate that the proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability, and provides superior utility for tasks involving text detection and text recognition.

Abstract

Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in the given conditions; (2) Reasonability: the regions and contents of the generated text should cohere with the scene; (3) Utility: the generated text images can facilitate related tasks (e.g., text detection and recognition). Upon investigation, we find that existing methods, either rendering-based or diffusion-based, can hardly meet all these aspects simultaneously, limiting their application range. Therefore, we propose in this paper a visual text generator (termed SceneVTG), which can produce high-quality text images in the wild. Following a two-stage paradigm, SceneVTG leverages a Multimodal Large Language Model to recommend reasonable text regions and contents across multiple scales and levels, which are used by a conditional diffusion model as conditions to generate text images. Extensive experiments demonstrate that the proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. Besides, the generated images provide superior utility for tasks involving text detection and text recognition. Code and datasets are available at AdvancedLiterateMachinery.

Visual Text Generation in the Wild

TL;DR

Extensive experiments demonstrate that the proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability, and provides superior utility for tasks involving text detection and text recognition.

Abstract

Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in the given conditions; (2) Reasonability: the regions and contents of the generated text should cohere with the scene; (3) Utility: the generated text images can facilitate related tasks (e.g., text detection and recognition). Upon investigation, we find that existing methods, either rendering-based or diffusion-based, can hardly meet all these aspects simultaneously, limiting their application range. Therefore, we propose in this paper a visual text generator (termed SceneVTG), which can produce high-quality text images in the wild. Following a two-stage paradigm, SceneVTG leverages a Multimodal Large Language Model to recommend reasonable text regions and contents across multiple scales and levels, which are used by a conditional diffusion model as conditions to generate text images. Extensive experiments demonstrate that the proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability. Besides, the generated images provide superior utility for tasks involving text detection and text recognition. Code and datasets are available at AdvancedLiterateMachinery.
Paper Structure (36 sections, 5 equations, 13 figures, 9 tables)

This paper contains 36 sections, 5 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Comparison with rendering-based methods, diffusion-based methods, and our method SceneVTG regarding fidelity, reasonability, and utility. Their advances and limitations are indicated in green and red. Zoom in for better views.
  • Figure 2: Pipelines of rendering-based methods, diffusion-based methods, and SceneVTG.
  • Figure 3: The overall architecture of SceneVTG. Given the background image and pre-defined text prompt, SceneVTG generates text regions and contents in two steps with an MLLM and then generates visual text image with a local conditional diffusion model.
  • Figure 4: The details pipeline of Local Visual Text Renderer. Given the TRCG outputs and background images, we construct image-level and embedding-level conditions to train the local conditional diffusion model.
  • Figure 5: Visualizations of end-to-end generation results compared with existing methods. SynthText and SceneVTG automatic render visual text on background images. TextDiffuser, GlyphControl, AnyText, and TextDiffuser-2 generate the entire images based on the same captions, text regions, and text contents. SceneVTG can generate more accurate characters and fit better into the regions. Zoom in for better views.
  • ...and 8 more figures