Table of Contents
Fetching ...

Recognition-Synergistic Scene Text Editing

Zhengyao Fang, Pengyuan Lyu, Jingjing Wu, Chengquan Zhang, Jun Yu, Guangming Lu, Wenjie Pei

TL;DR

RS-STE tackles the problem of editing scene text in images while preserving the original style by unifying text recognition and editing in a single Transformer-based framework. It leverages a Multi-modal Parallel Decoder to jointly predict the target text and the edited image from a reference style image, enabling implicit style-content disentanglement through recognition. A Cyclic Self-Supervised Fine-tuning pipeline trains on unpaired real data with cycle-consistent supervision and recognition losses, substantially improving real-world generalization. Empirical results on synthetic and real benchmarks show state-of-the-art editing quality and notable boosts to downstream recognition when using generated hard-case edits for data augmentation.

Abstract

Scene text editing aims to modify text content within scene images while maintaining style consistency. Traditional methods achieve this by explicitly disentangling style and content from the source image and then fusing the style with the target content, while ensuring content consistency using a pre-trained recognition model. Despite notable progress, these methods suffer from complex pipelines, leading to suboptimal performance in complex scenarios. In this work, we introduce Recognition-Synergistic Scene Text Editing (RS-STE), a novel approach that fully exploits the intrinsic synergy of text recognition for editing. Our model seamlessly integrates text recognition with text editing within a unified framework, and leverages the recognition model's ability to implicitly disentangle style and content while ensuring content consistency. Specifically, our approach employs a multi-modal parallel decoder based on transformer architecture, which predicts both text content and stylized images in parallel. Additionally, our cyclic self-supervised fine-tuning strategy enables effective training on unpaired real-world data without ground truth, enhancing style and content consistency through a twice-cyclic generation process. Built on a relatively simple architecture, RS-STE achieves state-of-the-art performance on both synthetic and real-world benchmarks, and further demonstrates the effectiveness of leveraging the generated hard cases to boost the performance of downstream recognition tasks. Code is available at https://github.com/ZhengyaoFang/RS-STE.

Recognition-Synergistic Scene Text Editing

TL;DR

RS-STE tackles the problem of editing scene text in images while preserving the original style by unifying text recognition and editing in a single Transformer-based framework. It leverages a Multi-modal Parallel Decoder to jointly predict the target text and the edited image from a reference style image, enabling implicit style-content disentanglement through recognition. A Cyclic Self-Supervised Fine-tuning pipeline trains on unpaired real data with cycle-consistent supervision and recognition losses, substantially improving real-world generalization. Empirical results on synthetic and real benchmarks show state-of-the-art editing quality and notable boosts to downstream recognition when using generated hard-case edits for data augmentation.

Abstract

Scene text editing aims to modify text content within scene images while maintaining style consistency. Traditional methods achieve this by explicitly disentangling style and content from the source image and then fusing the style with the target content, while ensuring content consistency using a pre-trained recognition model. Despite notable progress, these methods suffer from complex pipelines, leading to suboptimal performance in complex scenarios. In this work, we introduce Recognition-Synergistic Scene Text Editing (RS-STE), a novel approach that fully exploits the intrinsic synergy of text recognition for editing. Our model seamlessly integrates text recognition with text editing within a unified framework, and leverages the recognition model's ability to implicitly disentangle style and content while ensuring content consistency. Specifically, our approach employs a multi-modal parallel decoder based on transformer architecture, which predicts both text content and stylized images in parallel. Additionally, our cyclic self-supervised fine-tuning strategy enables effective training on unpaired real-world data without ground truth, enhancing style and content consistency through a twice-cyclic generation process. Built on a relatively simple architecture, RS-STE achieves state-of-the-art performance on both synthetic and real-world benchmarks, and further demonstrates the effectiveness of leveraging the generated hard cases to boost the performance of downstream recognition tasks. Code is available at https://github.com/ZhengyaoFang/RS-STE.

Paper Structure

This paper contains 25 sections, 6 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Prior methods for scene text editing involve intricate modeling for explicit separation of text content and background style. In contrast, our RS-STE conducts synergistic modeling of scene text recognition and text editing in a unified framework, which allows for implicit text-style separation while ensuring content consistency. Besides, the specially designed Cyclic Self-supervised Fine-tuning enables effective training of RS-STE on unpaired real-world data, substantially enhancing the generalizability in real-world scenarios.
  • Figure 2: Distribution of some content features extracted by our RS-STE. Images with the same text content but different background styles become closer in the encoded feature space of a recognition model, implying the capability of recognition models to separate style from content.
  • Figure 3: (a) illustrates the model structure of RS-STE and the fully-supervised pre-training stage using paired synthetic datasets. (b) depicts the cyclic self-supervised fine-tuning stage with unpaired real-world datasets.
  • Figure 4: Editing examples compared with other methods.
  • Figure 5: Visualization examples of ablation study.
  • ...and 4 more figures