Table of Contents
Fetching ...

How Control Information Influences Multilingual Text Image Generation and Editing?

Boqiang Zhang, Zuan Gao, Yadong Qu, Hongtao Xie

TL;DR

This work investigates how control information governs multilingual visual text generation in diffusion models, revealing that input encoding, denoising-stage roles, and output feature frequency content collectively shape text quality. It introduces TextGen, a two-stage global-to-detail framework augmented with Fourier Enhancement Convolution (FEC) and a frequency-balancing mechanism that modulates control signals within the U-Net, achieving unified generation and editing capabilities. A lightweight TG-2M dataset is built to train the model efficiently, enabling strong performance on both English and Chinese text generation with fewer data than prior methods. The approach advances practical multilingual scene-text generation and editing, offering a plasmable inference paradigm and design principles for future control-information-driven generation systems.

Abstract

Visual text generation has significantly advanced through diffusion models aimed at producing images with readable and realistic text. Recent works primarily use a ControlNet-based framework, employing standard font text images to control diffusion models. Recognizing the critical role of control information in generating high-quality text, we investigate its influence from three perspectives: input encoding, role at different stages, and output features. Our findings reveal that: 1) Input control information has unique characteristics compared to conventional inputs like Canny edges and depth maps. 2) Control information plays distinct roles at different stages of the denoising process. 3) Output control features significantly differ from the base and skip features of the U-Net decoder in the frequency domain. Based on these insights, we propose TextGen, a novel framework designed to enhance generation quality by optimizing control information. We improve input and output features using Fourier analysis to emphasize relevant information and reduce noise. Additionally, we employ a two-stage generation framework to align the different roles of control information at different stages. Furthermore, we introduce an effective and lightweight dataset for training. Our method achieves state-of-the-art performance in both Chinese and English text generation. The code and dataset available at https://github.com/CyrilSterling/TextGen.

How Control Information Influences Multilingual Text Image Generation and Editing?

TL;DR

This work investigates how control information governs multilingual visual text generation in diffusion models, revealing that input encoding, denoising-stage roles, and output feature frequency content collectively shape text quality. It introduces TextGen, a two-stage global-to-detail framework augmented with Fourier Enhancement Convolution (FEC) and a frequency-balancing mechanism that modulates control signals within the U-Net, achieving unified generation and editing capabilities. A lightweight TG-2M dataset is built to train the model efficiently, enabling strong performance on both English and Chinese text generation with fewer data than prior methods. The approach advances practical multilingual scene-text generation and editing, offering a plasmable inference paradigm and design principles for future control-information-driven generation systems.

Abstract

Visual text generation has significantly advanced through diffusion models aimed at producing images with readable and realistic text. Recent works primarily use a ControlNet-based framework, employing standard font text images to control diffusion models. Recognizing the critical role of control information in generating high-quality text, we investigate its influence from three perspectives: input encoding, role at different stages, and output features. Our findings reveal that: 1) Input control information has unique characteristics compared to conventional inputs like Canny edges and depth maps. 2) Control information plays distinct roles at different stages of the denoising process. 3) Output control features significantly differ from the base and skip features of the U-Net decoder in the frequency domain. Based on these insights, we propose TextGen, a novel framework designed to enhance generation quality by optimizing control information. We improve input and output features using Fourier analysis to emphasize relevant information and reduce noise. Additionally, we employ a two-stage generation framework to align the different roles of control information at different stages. Furthermore, we introduce an effective and lightweight dataset for training. Our method achieves state-of-the-art performance in both Chinese and English text generation. The code and dataset available at https://github.com/CyrilSterling/TextGen.
Paper Structure (28 sections, 2 equations, 12 figures, 4 tables)

This paper contains 28 sections, 2 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: The overall pipeline of recent text generation works. It utilizes a ControlNet for guiding the text generation process, employing a glyph image with a standard font as the control information. Control information at different stages is generated in the same manner and directly added to the skip features in the U-Net decoder.
  • Figure 2: Differences between text control information and general ControlNet control information, including anime line drawings, M-LSD lines, and Canny edges. General controls focus on the overall structure and tolerate localized errors, while text control requires precise detail.
  • Figure 3: Visualization of the impact of control at different denoising stages. Control information is removed in the gray segments of the color bar during denoising. (a) Since visual text generation requires much detail texture, control information in later stages still plays an important role. (b) Even with only glyph and position images as control information, early-stage control influences non-text regions, ensuring the text region is coherent and matches the background.
  • Figure 4: The pipeline of our TextGen. It comprises two stages: the global control stage and the detail control stage. Control information utilizes a Fourier Enhancement Convolution (FEC) block and a Spatial Convolution (SC) block to extract features. During inference, we introduce a novel denoising paradigm to unify generation and editing based on our framework design. Best shown in color.
  • Figure 5: The relative log amplitude of three parts of features in U-Net decoder.
  • ...and 7 more figures