Table of Contents
Fetching ...

OTR: Synthesizing Overlay Text Dataset for Text Removal

Jan Zdenek, Wataru Shimoda, Kota Yamaguchi

TL;DR

This work addresses the limitations of scene-text benchmarks for text removal, notably ground-truth artifacts and insufficient background complexity. It introduces OTR, a synthetic Overlay Text Removal dataset with artifact-free ground truth, achieved by overlaying text on clean images in an object-aware manner and using vision-language model-generated content. OTR comprises two test subsets, OTR-easy and OTR-hard, and a substantial training set, all with rich ground-truth annotations. The study demonstrates that NR-IQA metrics better reflect perceptual quality than traditional pixel-based metrics and shows that OTR-hard provides a harder, more discriminative benchmark due to higher background entropy. Overall, the dataset and evaluation approach offer a practical path toward robust, cross-domain text removal research using perceptual quality as a core criterion.

Abstract

Text removal is a crucial task in computer vision with applications such as privacy preservation, image editing, and media reuse. While existing research has primarily focused on scene text removal in natural images, limitations in current datasets hinder out-of-domain generalization or accurate evaluation. In particular, widely used benchmarks such as SCUT-EnsText suffer from ground truth artifacts due to manual editing, overly simplistic text backgrounds, and evaluation metrics that do not capture the quality of generated results. To address these issues, we introduce an approach to synthesizing a text removal benchmark applicable to domains other than scene texts. Our dataset features text rendered on complex backgrounds using object-aware placement and vision-language model-generated content, ensuring clean ground truth and challenging text removal scenarios. The dataset is available at https://huggingface.co/datasets/cyberagent/OTR .

OTR: Synthesizing Overlay Text Dataset for Text Removal

TL;DR

This work addresses the limitations of scene-text benchmarks for text removal, notably ground-truth artifacts and insufficient background complexity. It introduces OTR, a synthetic Overlay Text Removal dataset with artifact-free ground truth, achieved by overlaying text on clean images in an object-aware manner and using vision-language model-generated content. OTR comprises two test subsets, OTR-easy and OTR-hard, and a substantial training set, all with rich ground-truth annotations. The study demonstrates that NR-IQA metrics better reflect perceptual quality than traditional pixel-based metrics and shows that OTR-hard provides a harder, more discriminative benchmark due to higher background entropy. Overall, the dataset and evaluation approach offer a practical path toward robust, cross-domain text removal research using perceptual quality as a core criterion.

Abstract

Text removal is a crucial task in computer vision with applications such as privacy preservation, image editing, and media reuse. While existing research has primarily focused on scene text removal in natural images, limitations in current datasets hinder out-of-domain generalization or accurate evaluation. In particular, widely used benchmarks such as SCUT-EnsText suffer from ground truth artifacts due to manual editing, overly simplistic text backgrounds, and evaluation metrics that do not capture the quality of generated results. To address these issues, we introduce an approach to synthesizing a text removal benchmark applicable to domains other than scene texts. Our dataset features text rendered on complex backgrounds using object-aware placement and vision-language model-generated content, ensuring clean ground truth and challenging text removal scenarios. The dataset is available at https://huggingface.co/datasets/cyberagent/OTR .

Paper Structure

This paper contains 18 sections, 4 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Examples from SCUT-EnsText dataset. From left to right, original image (first), ground truth image (second), map of pixels whose difference in value between the original and ground truth image exceeds the set threshold (third), the absolute difference between the original image and ground truth (fourth). The difference between the original and ground truth image should be as little as possible, but the surrounding regions contain altered pixels. Ideally, there should be no difference between the two when text stroke regions are excluded.
  • Figure 2: Results of text removal by EraseNet (middle) and FLUX.1 Fill (right). While PSNR, SSIM and AGE metrics suggest that results of EraseNet are better, results of FLUX.1 Fill look visually more convincing from a human perspective.
  • Figure 3: A diagram of our data generation process. We use a scene text detection model to filter out images that already contain text. Images with no detected text are passed to a VLM that generates short descriptive phrases, which are then rendered as overlay text on the images.
  • Figure 4: Text removal results produced by PERT, ViT-Eraser and FLUX.1 Fill and a comparison of discrepancy in results of different metrics.
  • Figure 5: Correlation between the information entropy and several metric scores.