Table of Contents
Fetching ...

On Manipulating Scene Text in the Wild with Diffusion Models

Joshua Santoso, Christian Simon, Williem

TL;DR

This paper tackles the challenge of editing scene text in wild images while preserving the surrounding visual details. It introduces DBEST, a diffusion-based framework that uses two main strategies: one-shot style adaptation to retain source appearance, and text recognition guidance to ensure readable target text. By pretraining on a SynText synthetic dataset and fine-tuning in two optimization stages, DBEST achieves state-of-the-art OCR and image-quality performance on synthetic and in-the-wild benchmarks. The work demonstrates practical potential for real-world text editing tasks, such as sign translation and privacy-preserving edits, while noting limitations in inference speed and long-text handling that warrant future work.

Abstract

Diffusion models have gained attention for image editing yielding impressive results in text-to-image tasks. On the downside, one might notice that generated images of stable diffusion models suffer from deteriorated details. This pitfall impacts image editing tasks that require information preservation e.g., scene text editing. As a desired result, the model must show the capability to replace the text on the source image to the target text while preserving the details e.g., color, font size, and background. To leverage the potential of diffusion models, in this work, we introduce Diffusion-BasEd Scene Text manipulation Network so-called DBEST. Specifically, we design two adaptation strategies, namely one-shot style adaptation and text-recognition guidance. In experiments, we thoroughly assess and compare our proposed method against state-of-the-arts on various scene text datasets, then provide extensive ablation studies for each granularity to analyze our performance gain. Also, we demonstrate the effectiveness of our proposed method to synthesize scene text indicated by competitive Optical Character Recognition (OCR) accuracy. Our method achieves 94.15% and 98.12% on COCO-text and ICDAR2013 datasets for character-level evaluation.

On Manipulating Scene Text in the Wild with Diffusion Models

TL;DR

This paper tackles the challenge of editing scene text in wild images while preserving the surrounding visual details. It introduces DBEST, a diffusion-based framework that uses two main strategies: one-shot style adaptation to retain source appearance, and text recognition guidance to ensure readable target text. By pretraining on a SynText synthetic dataset and fine-tuning in two optimization stages, DBEST achieves state-of-the-art OCR and image-quality performance on synthetic and in-the-wild benchmarks. The work demonstrates practical potential for real-world text editing tasks, such as sign translation and privacy-preserving edits, while noting limitations in inference speed and long-text handling that warrant future work.

Abstract

Diffusion models have gained attention for image editing yielding impressive results in text-to-image tasks. On the downside, one might notice that generated images of stable diffusion models suffer from deteriorated details. This pitfall impacts image editing tasks that require information preservation e.g., scene text editing. As a desired result, the model must show the capability to replace the text on the source image to the target text while preserving the details e.g., color, font size, and background. To leverage the potential of diffusion models, in this work, we introduce Diffusion-BasEd Scene Text manipulation Network so-called DBEST. Specifically, we design two adaptation strategies, namely one-shot style adaptation and text-recognition guidance. In experiments, we thoroughly assess and compare our proposed method against state-of-the-arts on various scene text datasets, then provide extensive ablation studies for each granularity to analyze our performance gain. Also, we demonstrate the effectiveness of our proposed method to synthesize scene text indicated by competitive Optical Character Recognition (OCR) accuracy. Our method achieves 94.15% and 98.12% on COCO-text and ICDAR2013 datasets for character-level evaluation.
Paper Structure (32 sections, 10 equations, 14 figures, 5 tables, 1 algorithm)

This paper contains 32 sections, 10 equations, 14 figures, 5 tables, 1 algorithm.

Figures (14)

  • Figure 1: Top: Comparison between state-of-the-art methods and our method from given input image and target text on scene text manipulation. Bottom: Comparison between baseline text-to-image Latent Diffusion Model (LDM) RombachCVPR2022 represented with blue box and our method represented with red box on scene text domain from given random noise and text condition as an input.
  • Figure 2: The pipeline of our proposed method. The process is divided into 2 steps. One-shot style adaptation for fine-tuning the diffusion model and text recognition guidance for optimizing the target embedding.
  • Figure 3: Qualitative comparison on COCO-Text VeitArXiv2016 and ICDAR2013 KaratzasICDAR2013 datasets. DBEST (ours) achieves superior qualitative results compared to SRNet WuACMM2019, STEFFAN RoyCVPR2020, SDEdit MengICLR2022, Imagic KawarArXiv2022, Null-Inv RonArXiv2022+p2p HertzArXiv2022, Text2Img LDM RombachCVPR2022, Text2Live BartalECCV2022.
  • Figure 4: Result of text scene manipulation on row 1: ICDAR2015 KaratzasICDAR2015, row 2: IIIT5K MishraCVPR2012, and row 3: SVT PhanICCV2013. The input image is represented by green box and the edited version by red box.
  • Figure 5: Given a single word green box from in-the-wild images and the desired text, our method successfully to edit the text with the desired text in the image red box.
  • ...and 9 more figures