Table of Contents
Fetching ...

Global-Local Aware Scene Text Editing

Fuxiang Yang, Tonghua Su, Donglin Di, Yin Chen, Xiangqian Wu, Zhongjie Wang, Lei Fan

TL;DR

GLASTE addresses scene text editing by tackling inconsistency between edited patches and surrounding context and by enabling length-insensitive edits. It introduces a global-local framework with an inpainting module, a foreground synthesis module, and an affine fusion module to produce coherent, readable edits that adapt to target text length. A size-independent style vector, Rotated RoIAlign, AdaIN-based synthesis, and an affine transformation enable flexible rendering while preserving global image harmony; joint global-local losses with PatchGAN discriminators optimize both background fidelity and local text quality. Experimental results on real and synthetic data demonstrate state-of-the-art performance and robust recognition; diffusion-models are noted as a future direction due to computational trade-offs and text-generation challenges.

Abstract

Scene Text Editing (STE) involves replacing text in a scene image with new target text while preserving both the original text style and background texture. Existing methods suffer from two major challenges: inconsistency and length-insensitivity. They often fail to maintain coherence between the edited local patch and the surrounding area, and they struggle to handle significant differences in text length before and after editing. To tackle these challenges, we propose an end-to-end framework called Global-Local Aware Scene Text Editing (GLASTE), which simultaneously incorporates high-level global contextual information along with delicate local features. Specifically, we design a global-local combination structure, joint global and local losses, and enhance text image features to ensure consistency in text style within local patches while maintaining harmony between local and global areas. Additionally, we express the text style as a vector independent of the image size, which can be transferred to target text images of various sizes. We use an affine fusion to fill target text images into the editing patch while maintaining their aspect ratio unchanged. Extensive experiments on real-world datasets validate that our GLASTE model outperforms previous methods in both quantitative metrics and qualitative results and effectively mitigates the two challenges.

Global-Local Aware Scene Text Editing

TL;DR

GLASTE addresses scene text editing by tackling inconsistency between edited patches and surrounding context and by enabling length-insensitive edits. It introduces a global-local framework with an inpainting module, a foreground synthesis module, and an affine fusion module to produce coherent, readable edits that adapt to target text length. A size-independent style vector, Rotated RoIAlign, AdaIN-based synthesis, and an affine transformation enable flexible rendering while preserving global image harmony; joint global-local losses with PatchGAN discriminators optimize both background fidelity and local text quality. Experimental results on real and synthetic data demonstrate state-of-the-art performance and robust recognition; diffusion-models are noted as a future direction due to computational trade-offs and text-generation challenges.

Abstract

Scene Text Editing (STE) involves replacing text in a scene image with new target text while preserving both the original text style and background texture. Existing methods suffer from two major challenges: inconsistency and length-insensitivity. They often fail to maintain coherence between the edited local patch and the surrounding area, and they struggle to handle significant differences in text length before and after editing. To tackle these challenges, we propose an end-to-end framework called Global-Local Aware Scene Text Editing (GLASTE), which simultaneously incorporates high-level global contextual information along with delicate local features. Specifically, we design a global-local combination structure, joint global and local losses, and enhance text image features to ensure consistency in text style within local patches while maintaining harmony between local and global areas. Additionally, we express the text style as a vector independent of the image size, which can be transferred to target text images of various sizes. We use an affine fusion to fill target text images into the editing patch while maintaining their aspect ratio unchanged. Extensive experiments on real-world datasets validate that our GLASTE model outperforms previous methods in both quantitative metrics and qualitative results and effectively mitigates the two challenges.

Paper Structure

This paper contains 13 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: (a) shows the running process of the previous STE methods, where the generation model processes text patches, which relies on the "crop-and-paste" operation. (b) Our GLASTE method directly uses the entire image as input, inpainting the specified text region and then rendering the target text within that area to generate a scene image.
  • Figure 2: The overall structure of GLASTE. The network consists of an inpainting module, a foreground module, and an affine fusion module. The foreground module includes a style encoder, a content encoder, and a text synthesizer.
  • Figure 3: Examples of scene text editing results of our GLASTE.
  • Figure 4: Comparison of previous methods and our GLASTE.
  • Figure 5: a) Editing from fixed source text to variable length target texts. b) Use synthetic paired data to alleviate overfitting. "w/o pd" means "w/o paired data".