Table of Contents
Fetching ...

First Creating Backgrounds Then Rendering Texts: A New Paradigm for Visual Text Blending

Zhenhang Li, Yan Shu, Weichao Zeng, Dongbao Yang, Yu Zhou

TL;DR

A background generator is developed to produce high-fidelity and text-free natural images and a text renderer named GlyphOnly is designed for achieving visually plausible text-background integration, designed for achieving visually plausible text-background integration.

Abstract

Diffusion models, known for their impressive image generation abilities, have played a pivotal role in the rise of visual text generation. Nevertheless, existing visual text generation methods often focus on generating entire images with text prompts, leading to imprecise control and limited practicality. A more promising direction is visual text blending, which focuses on seamlessly merging texts onto text-free backgrounds. However, existing visual text blending methods often struggle to generate high-fidelity and diverse images due to a shortage of backgrounds for synthesis and limited generalization capabilities. To overcome these challenges, we propose a new visual text blending paradigm including both creating backgrounds and rendering texts. Specifically, a background generator is developed to produce high-fidelity and text-free natural images. Moreover, a text renderer named GlyphOnly is designed for achieving visually plausible text-background integration. GlyphOnly, built on a Stable Diffusion framework, utilizes glyphs and backgrounds as conditions for accurate rendering and consistency control, as well as equipped with an adaptive text block exploration strategy for small-scale text rendering. We also explore several downstream applications based on our method, including scene text dataset synthesis for boosting scene text detectors, as well as text image customization and editing. Code and model will be available at \url{https://github.com/Zhenhang-Li/GlyphOnly}.

First Creating Backgrounds Then Rendering Texts: A New Paradigm for Visual Text Blending

TL;DR

A background generator is developed to produce high-fidelity and text-free natural images and a text renderer named GlyphOnly is designed for achieving visually plausible text-background integration, designed for achieving visually plausible text-background integration.

Abstract

Diffusion models, known for their impressive image generation abilities, have played a pivotal role in the rise of visual text generation. Nevertheless, existing visual text generation methods often focus on generating entire images with text prompts, leading to imprecise control and limited practicality. A more promising direction is visual text blending, which focuses on seamlessly merging texts onto text-free backgrounds. However, existing visual text blending methods often struggle to generate high-fidelity and diverse images due to a shortage of backgrounds for synthesis and limited generalization capabilities. To overcome these challenges, we propose a new visual text blending paradigm including both creating backgrounds and rendering texts. Specifically, a background generator is developed to produce high-fidelity and text-free natural images. Moreover, a text renderer named GlyphOnly is designed for achieving visually plausible text-background integration. GlyphOnly, built on a Stable Diffusion framework, utilizes glyphs and backgrounds as conditions for accurate rendering and consistency control, as well as equipped with an adaptive text block exploration strategy for small-scale text rendering. We also explore several downstream applications based on our method, including scene text dataset synthesis for boosting scene text detectors, as well as text image customization and editing. Code and model will be available at \url{https://github.com/Zhenhang-Li/GlyphOnly}.

Paper Structure

This paper contains 35 sections, 5 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Visual Texts generated by (a) Existing Visual Text Generation Methods, and (b) Our Visual Blending Method.
  • Figure 2: The framework of the proposed method. The first stage is creating background, which involves synthesis, erasing and evaluation. In the second rendering texts stage, GlyphOnly integrates noisy features, segmentation masks, feature masks, and masked features as inputs to the U-Net. The frozen pre-trained CLIP Image Encoder converts glyph images and background images into embeddings for generation control. During training, only the parameters of the convolutional layers of the U-Net input, the convolutional layers of the conditional input, and the key and value components of the U-Net cross-attention layers are updated. Please be aware that the diffusion model performs denoising in the latent space, but we utilize image pixels for better visualization.
  • Figure 3: Visualization comparison between our approach and existing methods.
  • Figure 4: Qualitative comparison results. We compare our method with the SOTA direct generation approach.
  • Figure 5: Visualization of tiny-size text generation.
  • ...and 1 more figures