OTR: Synthesizing Overlay Text Dataset for Text Removal
Jan Zdenek, Wataru Shimoda, Kota Yamaguchi
TL;DR
This work addresses the limitations of scene-text benchmarks for text removal, notably ground-truth artifacts and insufficient background complexity. It introduces OTR, a synthetic Overlay Text Removal dataset with artifact-free ground truth, achieved by overlaying text on clean images in an object-aware manner and using vision-language model-generated content. OTR comprises two test subsets, OTR-easy and OTR-hard, and a substantial training set, all with rich ground-truth annotations. The study demonstrates that NR-IQA metrics better reflect perceptual quality than traditional pixel-based metrics and shows that OTR-hard provides a harder, more discriminative benchmark due to higher background entropy. Overall, the dataset and evaluation approach offer a practical path toward robust, cross-domain text removal research using perceptual quality as a core criterion.
Abstract
Text removal is a crucial task in computer vision with applications such as privacy preservation, image editing, and media reuse. While existing research has primarily focused on scene text removal in natural images, limitations in current datasets hinder out-of-domain generalization or accurate evaluation. In particular, widely used benchmarks such as SCUT-EnsText suffer from ground truth artifacts due to manual editing, overly simplistic text backgrounds, and evaluation metrics that do not capture the quality of generated results. To address these issues, we introduce an approach to synthesizing a text removal benchmark applicable to domains other than scene texts. Our dataset features text rendered on complex backgrounds using object-aware placement and vision-language model-generated content, ensuring clean ground truth and challenging text removal scenarios. The dataset is available at https://huggingface.co/datasets/cyberagent/OTR .
