Table of Contents
Fetching ...

ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining

Dezhi Peng, Chongyu Liu, Yuliang Liu, Lianwen Jin

TL;DR

ViTEraser introduces a Vision Transformer–based one-stage scene text removal framework that inherently couples text localization and inpainting within a single encoder–decoder. A novel SegMIM pretraining scheme reinforces the model by training the encoder on text box segmentation and the decoder on masked image modeling using large-scale scene text data, yielding strong global reasoning and plausible background reconstruction. Empirical results on SCUT-EnsText and SCUT-Syn demonstrate state-of-the-art STR performance, while extensions to Tampered-IC13 show robust generalization to related tasks. The work provides comprehensive insights into ViT-based STR design, pretraining, and scalability, offering a scalable path for pixel-level OCR-related reasoning.

Abstract

Scene text removal (STR) aims at replacing text strokes in natural scenes with visually coherent backgrounds. Recent STR approaches rely on iterative refinements or explicit text masks, resulting in high complexity and sensitivity to the accuracy of text localization. Moreover, most existing STR methods adopt convolutional architectures while the potential of vision Transformers (ViTs) remains largely unexplored. In this paper, we propose a simple-yet-effective ViT-based text eraser, dubbed ViTEraser. Following a concise encoder-decoder framework, ViTEraser can easily incorporate various ViTs to enhance long-range modeling. Specifically, the encoder hierarchically maps the input image into the hidden space through ViT blocks and patch embedding layers, while the decoder gradually upsamples the hidden features to the text-erased image with ViT blocks and patch splitting layers. As ViTEraser implicitly integrates text localization and inpainting, we propose a novel end-to-end pretraining method, termed SegMIM, which focuses the encoder and decoder on the text box segmentation and masked image modeling tasks, respectively. Experimental results demonstrate that ViTEraser with SegMIM achieves state-of-the-art performance on STR by a substantial margin and exhibits strong generalization ability when extended to other tasks, \textit{e.g.}, tampered scene text detection. Furthermore, we comprehensively explore the architecture, pretraining, and scalability of the ViT-based encoder-decoder for STR, which provides deep insights into the application of ViT to the STR field. Code is available at https://github.com/shannanyinxiang/ViTEraser.

ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining

TL;DR

ViTEraser introduces a Vision Transformer–based one-stage scene text removal framework that inherently couples text localization and inpainting within a single encoder–decoder. A novel SegMIM pretraining scheme reinforces the model by training the encoder on text box segmentation and the decoder on masked image modeling using large-scale scene text data, yielding strong global reasoning and plausible background reconstruction. Empirical results on SCUT-EnsText and SCUT-Syn demonstrate state-of-the-art STR performance, while extensions to Tampered-IC13 show robust generalization to related tasks. The work provides comprehensive insights into ViT-based STR design, pretraining, and scalability, offering a scalable path for pixel-level OCR-related reasoning.

Abstract

Scene text removal (STR) aims at replacing text strokes in natural scenes with visually coherent backgrounds. Recent STR approaches rely on iterative refinements or explicit text masks, resulting in high complexity and sensitivity to the accuracy of text localization. Moreover, most existing STR methods adopt convolutional architectures while the potential of vision Transformers (ViTs) remains largely unexplored. In this paper, we propose a simple-yet-effective ViT-based text eraser, dubbed ViTEraser. Following a concise encoder-decoder framework, ViTEraser can easily incorporate various ViTs to enhance long-range modeling. Specifically, the encoder hierarchically maps the input image into the hidden space through ViT blocks and patch embedding layers, while the decoder gradually upsamples the hidden features to the text-erased image with ViT blocks and patch splitting layers. As ViTEraser implicitly integrates text localization and inpainting, we propose a novel end-to-end pretraining method, termed SegMIM, which focuses the encoder and decoder on the text box segmentation and masked image modeling tasks, respectively. Experimental results demonstrate that ViTEraser with SegMIM achieves state-of-the-art performance on STR by a substantial margin and exhibits strong generalization ability when extended to other tasks, \textit{e.g.}, tampered scene text detection. Furthermore, we comprehensively explore the architecture, pretraining, and scalability of the ViT-based encoder-decoder for STR, which provides deep insights into the application of ViT to the STR field. Code is available at https://github.com/shannanyinxiang/ViTEraser.
Paper Structure (48 sections, 10 equations, 11 figures, 9 tables)

This paper contains 48 sections, 10 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Comparison of ViTEraser with existing STR paradigms. Our method revisits the conventional single-step one-stage framework and improves it with ViTs for feature modeling and the proposed SegMIM pretraining. Dashed arrows indicate cutting off gradient flow. (Loc.: Localization)
  • Figure 2: Overall architecture of ViTEraser. The ViTEraser follows the one-stage paradigm but is thoroughly equipped with ViTs, yielding a simple-yet-effective STR approach that is free of progressive refinements and text localizing processes.
  • Figure 3: Auxiliary outputs of ViTEraser during training, including (a) text box segmentation map and (b) multi-scale text erasing results. (TE: text erasing)
  • Figure 4: Pipeline of the proposed SegMIM pretraining. Given a randomly masked image, the text box segmentation and masked image modeling tasks are accomplished on top of the encoder and decoder, respectively.
  • Figure 5: Visualizations of SegMIM. (Pred.: Predicted)
  • ...and 6 more figures