Table of Contents
Fetching ...

TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control

Weichao Zeng, Yan Shu, Zhenhang Li, Dongbao Yang, Yu Zhou

TL;DR

This work proposes TextCtrl, a diffusion-based method that edits text with prior guidance control that explicitly incorporates Style-Structure guidance into model design and network training, significantly improving text style consistency and rendering accuracy.

Abstract

Centred on content modification and style preservation, Scene Text Editing (STE) remains a challenging task despite considerable progress in text-to-image synthesis and text-driven image manipulation recently. GAN-based STE methods generally encounter a common issue of model generalization, while Diffusion-based STE methods suffer from undesired style deviations. To address these problems, we propose TextCtrl, a diffusion-based method that edits text with prior guidance control. Our method consists of two key components: (i) By constructing fine-grained text style disentanglement and robust text glyph structure representation, TextCtrl explicitly incorporates Style-Structure guidance into model design and network training, significantly improving text style consistency and rendering accuracy. (ii) To further leverage the style prior, a Glyph-adaptive Mutual Self-attention mechanism is proposed which deconstructs the implicit fine-grained features of the source image to enhance style consistency and vision quality during inference. Furthermore, to fill the vacancy of the real-world STE evaluation benchmark, we create the first real-world image-pair dataset termed ScenePair for fair comparisons. Experiments demonstrate the effectiveness of TextCtrl compared with previous methods concerning both style fidelity and text accuracy.

TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control

TL;DR

This work proposes TextCtrl, a diffusion-based method that edits text with prior guidance control that explicitly incorporates Style-Structure guidance into model design and network training, significantly improving text style consistency and rendering accuracy.

Abstract

Centred on content modification and style preservation, Scene Text Editing (STE) remains a challenging task despite considerable progress in text-to-image synthesis and text-driven image manipulation recently. GAN-based STE methods generally encounter a common issue of model generalization, while Diffusion-based STE methods suffer from undesired style deviations. To address these problems, we propose TextCtrl, a diffusion-based method that edits text with prior guidance control. Our method consists of two key components: (i) By constructing fine-grained text style disentanglement and robust text glyph structure representation, TextCtrl explicitly incorporates Style-Structure guidance into model design and network training, significantly improving text style consistency and rendering accuracy. (ii) To further leverage the style prior, a Glyph-adaptive Mutual Self-attention mechanism is proposed which deconstructs the implicit fine-grained features of the source image to enhance style consistency and vision quality during inference. Furthermore, to fill the vacancy of the real-world STE evaluation benchmark, we create the first real-world image-pair dataset termed ScenePair for fair comparisons. Experiments demonstrate the effectiveness of TextCtrl compared with previous methods concerning both style fidelity and text accuracy.

Paper Structure

This paper contains 46 sections, 15 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Conceptual illustration of the decomposition of STE by TextCtrl. (a) Text style is disentangled into text background, text foreground, text font glyph and text color features. (b) Text glyph structure is represented by the cluster centroid of various font text features. (c) The explicit style features and structure features guide the generator to perform scene text editing.
  • Figure 2: Decomposed framework of TextCtrl. (a) Text glyph structure encoder $\mathcal{T}$ with corresponding glyph structure representation pre-training. (b) Text style encoder $\mathcal{S}$ with corresponding style disentanglement pre-training. (c) Prior guided diffusion generator $\mathcal{G}$. (d) The improved inference control with the Glyph-adaptive Mutual Self-attention mechanism.
  • Figure 3: Qualitative comparison among different methods. Note that for the inpainting-based methods Ji2023ImprovingDMChen2023TextDiffuserDMTuo2023AnyTextMV, we conduct the editing on the full-size images and perform the visualization of the edited text region.
  • Figure 4: Qualitative comparison with inpainting-based methods Ji2023ImprovingDMChen2023TextDiffuserDMTuo2023AnyTextMV on full-size images.
  • Figure 5: t-SNE Maaten2008VisualizingDU visualization of style features by pre-trained text style encoder.
  • ...and 6 more figures