Table of Contents
Fetching ...

DeepEraser: Deep Iterative Context Mining for Generic Text Eraser

Hao Feng, Wendi Wang, Shaokai Liu, Jiajun Deng, Wengang Zhou, Houqiang Li

TL;DR

DeepEraser tackles generic text removal by introducing a fixed-resolution, recurrent architecture that progressively erases targeted text through iterative context mining. It relies on a lightweight shared erasing module and a simple L1 training objective, enhanced by a custom mask generation strategy that supports adaptive removal. The approach achieves state-of-the-art results on SCUT-EnsText, SCUT-Syn, and Oxford Synthetic with about 1.4M parameters and demonstrates strong generalization to user-defined masks. This work provides a practical, efficient solution for privacy-preserving text removal and scene-text editing with robust qualitative and quantitative gains.

Abstract

In this work, we present DeepEraser, an effective deep network for generic text removal. DeepEraser utilizes a recurrent architecture that erases the text in an image via iterative operations. Our idea comes from the process of erasing pencil script, where the text area designated for removal is subject to continuous monitoring and the text is attenuated progressively, ensuring a thorough and clean erasure. Technically, at each iteration, an innovative erasing module is deployed, which not only explicitly aggregates the previous erasing progress but also mines additional semantic context to erase the target text. Through iterative refinements, the text regions are progressively replaced with more appropriate content and finally converge to a relatively accurate status. Furthermore, a custom mask generation strategy is introduced to improve the capability of DeepEraser for adaptive text removal, as opposed to indiscriminately removing all the text in an image. Our DeepEraser is notably compact with only 1.4M parameters and trained in an end-to-end manner. To verify its effectiveness, extensive experiments are conducted on several prevalent benchmarks, including SCUT-Syn, SCUT-EnsText, and Oxford Synthetic text dataset. The quantitative and qualitative results demonstrate the effectiveness of our DeepEraser over the state-of-the-art methods, as well as its strong generalization ability in custom mask text removal. The codes and pre-trained models are available at https://github.com/fh2019ustc/DeepEraser

DeepEraser: Deep Iterative Context Mining for Generic Text Eraser

TL;DR

DeepEraser tackles generic text removal by introducing a fixed-resolution, recurrent architecture that progressively erases targeted text through iterative context mining. It relies on a lightweight shared erasing module and a simple L1 training objective, enhanced by a custom mask generation strategy that supports adaptive removal. The approach achieves state-of-the-art results on SCUT-EnsText, SCUT-Syn, and Oxford Synthetic with about 1.4M parameters and demonstrates strong generalization to user-defined masks. This work provides a practical, efficient solution for privacy-preserving text removal and scene-text editing with robust qualitative and quantitative gains.

Abstract

In this work, we present DeepEraser, an effective deep network for generic text removal. DeepEraser utilizes a recurrent architecture that erases the text in an image via iterative operations. Our idea comes from the process of erasing pencil script, where the text area designated for removal is subject to continuous monitoring and the text is attenuated progressively, ensuring a thorough and clean erasure. Technically, at each iteration, an innovative erasing module is deployed, which not only explicitly aggregates the previous erasing progress but also mines additional semantic context to erase the target text. Through iterative refinements, the text regions are progressively replaced with more appropriate content and finally converge to a relatively accurate status. Furthermore, a custom mask generation strategy is introduced to improve the capability of DeepEraser for adaptive text removal, as opposed to indiscriminately removing all the text in an image. Our DeepEraser is notably compact with only 1.4M parameters and trained in an end-to-end manner. To verify its effectiveness, extensive experiments are conducted on several prevalent benchmarks, including SCUT-Syn, SCUT-EnsText, and Oxford Synthetic text dataset. The quantitative and qualitative results demonstrate the effectiveness of our DeepEraser over the state-of-the-art methods, as well as its strong generalization ability in custom mask text removal. The codes and pre-trained models are available at https://github.com/fh2019ustc/DeepEraser
Paper Structure (16 sections, 3 equations, 9 figures, 7 tables)

This paper contains 16 sections, 3 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Quantitative metrics for different methods on the SCUT-EnsText benchmark zdenek2020erasing. "*" denotes that the predicted text-free images preserve the non-text regions of inputs, while the other methods use the model outputs directly. Our DeepEraser presents the best performance while enjoying the fewest number of model parameters.
  • Figure 2: An overview of DeepEraser for text removal. Given a text image $\bm{I}_0$ and a binary mask $\bm{M}_0$ indicating the regions for text removal, we first extract the feature through a CNN-based backbone. Then, a shared erasing module refines the estimated text-free image across $K$ iterations. At the $k^{th}$ iteration, it explicitly aggregates the previous erasing progress and outputs the residual image $\bm{r}_k$ to update the erasing result $\bm{I}_k$. After $K$ iterations, we obtain the final predicted text-free image $\bm{I}_K$.
  • Figure 3: Architecture of backbone for feature extraction.
  • Figure 4: An illustration of the $k^{th}$ iteration in the erasing module. It takes (1) context feature $\bm{E}_I$, (2) previously estimated text-free image $\bm{I}_{k-1}$, and (3) latent feature $\bm{l}_{k-1}$ as input, and outputs the updated latent $\bm{l}_{k}$ and current residual image $\bm{r}_k$.
  • Figure 5: Qualitative results of each iteration in the inference stage on SCUT-EnsText zdenek2020erasing. The first row of the two examples presents the input text image $\bm{I}_0$, the input text-erased mask $\bm{M}_0$, the text-preserved mask $\bm{M}_p$, and the ground truth $\bm{I}_{gt}$, respectively. The second and third rows are the predicted text-free images $\bm{I}_k, k=\{1,2,...,8\}$. With the iteration times increasing, the text is erased progressively.
  • ...and 4 more figures