Table of Contents
Fetching ...

Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling

Zixiao Wang, Hongtao Xie, YuXin Wang, Yadong Qu, Fengjun Guo, Pengwei Liu

TL;DR

A Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box), which greatly alleviates the limitation of the high-cost STR labels.

Abstract

Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available at https://github.com/wzx99/TMIM.

Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling

TL;DR

A Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box), which greatly alleviates the limitation of the high-cost STR labels.

Abstract

Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling. In this paper, we aim to address this issue by introducing a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels (e.g., text bounding box). Different from previous pretraining methods that use indirect auxiliary tasks only to enhance the implicit feature extraction ability, our TMIM first enables the STR task to be directly trained in a weakly supervised manner, which explores the STR knowledge explicitly and efficiently. In TMIM, first, a Background Modeling stream is built to learn background generation rules by recovering the masked non-text region. Meanwhile, it provides pseudo STR labels on the masked text region. Second, a Text Erasing stream is proposed to learn from the pseudo labels and equip the model with end-to-end STR ability. Benefiting from the two collaborative streams, our STR model can achieve impressive performance only with the public text detection datasets, which greatly alleviates the limitation of the high-cost STR labels. Experiments demonstrate that our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText). Code will be available at https://github.com/wzx99/TMIM.
Paper Structure (14 sections, 9 equations, 5 figures, 8 tables)

This paper contains 14 sections, 9 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: (a)Comparison of the annotation in scene text removal (STR) task and scene text detection (STD) task. It is clear that STR requires fine-grained manual modification. While STD only requires box-level text detection labeling which can be obtained by advanced OCR systems. (b-c) previous fully supervised liu2022don and pretraining peng2023viteraser methods. (d)our TMIM.
  • Figure 2: The overall framework of our TMIM, which consists of two streams: Background Modeling (BM) and Text Erasing (TE)stream. First, the BM stream uses the masked image as the input and trains the model to recover the background regions. Meanwhile, the TE stream adopts the recovery results from the BM stream to build the pseudo labels and train the model for STR.
  • Figure 3: Comparison with MIM and our BM during the training process. MIM tends to generate text-like content and our BM learns to erase texts.
  • Figure 4: The qualitative results on SCUT-EnsText compared with Erasenetliu2020erasenet, Tang et al.tang2021stroke, and CTRNetliu2022don.
  • Figure 5: The qualitative results compared with the inpainting method Lama suvorov2022resolution. (c) is the input for inpainting method masked by the text detection ground truth. (d) is the result of LaMa suvorov2022resolution.