TextSleuth: Towards Explainable Tampered Text Detection
Chenfan Qu, Jian Liu, Haoxing Chen, Baihan Yu, Jingjing Liu, Weiqiang Wang, Lianwen Jin
TL;DR
This work tackles the problem of explainable tampered text detection by introducing the Explainable Tampered Text Detection (ETTD) dataset and the TextSleuth model. It combines pixel-level tampered-region annotations with natural-language anomaly descriptions generated via GPT-4o, guided by a fused mask prompt and OCR-based filtering to ensure description quality. TextSleuth employs a two-stage detection-grounding approach with an auxiliary grounding prompt to focus reasoning on the suspected region, improving both fine-grained perception and cross-domain generalization. The results demonstrate strong in-domain and cross-domain performance on ETTD and public datasets, establishing a new benchmark and offering practical insights for interpretable forensics in text-rich imagery.
Abstract
Recently, tampered text detection has attracted increasing attention due to its essential role in information security. Although existing methods can detect the tampered text region, the interpretation of such detection remains unclear, making the prediction unreliable. To address this problem, we propose to explain the basis of tampered text detection with natural language via large multimodal models. To fill the data gap for this task, we propose a large-scale, comprehensive dataset, ETTD, which contains both pixel-level annotations for tampered text region and natural language annotations describing the anomaly of the tampered text. Multiple methods are employed to improve the quality of the proposed data. For example, elaborate queries are introduced to generate high-quality anomaly descriptions with GPT4o. A fused mask prompt is proposed to reduce confusion when querying GPT4o to generate anomaly descriptions. To automatically filter out low-quality annotations, we also propose to prompt GPT4o to recognize tampered texts before describing the anomaly, and to filter out the responses with low OCR accuracy. To further improve explainable tampered text detection, we propose a simple yet effective model called TextSleuth, which achieves improved fine-grained perception and cross-domain generalization by focusing on the suspected region, with a two-stage analysis paradigm and an auxiliary grounding prompt. Extensive experiments on both the ETTD dataset and the public dataset have verified the effectiveness of the proposed methods. In-depth analysis is also provided to inspire further research. Our dataset and code will be open-source.
