Table of Contents
Fetching ...

TextSleuth: Towards Explainable Tampered Text Detection

Chenfan Qu, Jian Liu, Haoxing Chen, Baihan Yu, Jingjing Liu, Weiqiang Wang, Lianwen Jin

TL;DR

This work tackles the problem of explainable tampered text detection by introducing the Explainable Tampered Text Detection (ETTD) dataset and the TextSleuth model. It combines pixel-level tampered-region annotations with natural-language anomaly descriptions generated via GPT-4o, guided by a fused mask prompt and OCR-based filtering to ensure description quality. TextSleuth employs a two-stage detection-grounding approach with an auxiliary grounding prompt to focus reasoning on the suspected region, improving both fine-grained perception and cross-domain generalization. The results demonstrate strong in-domain and cross-domain performance on ETTD and public datasets, establishing a new benchmark and offering practical insights for interpretable forensics in text-rich imagery.

Abstract

Recently, tampered text detection has attracted increasing attention due to its essential role in information security. Although existing methods can detect the tampered text region, the interpretation of such detection remains unclear, making the prediction unreliable. To address this problem, we propose to explain the basis of tampered text detection with natural language via large multimodal models. To fill the data gap for this task, we propose a large-scale, comprehensive dataset, ETTD, which contains both pixel-level annotations for tampered text region and natural language annotations describing the anomaly of the tampered text. Multiple methods are employed to improve the quality of the proposed data. For example, elaborate queries are introduced to generate high-quality anomaly descriptions with GPT4o. A fused mask prompt is proposed to reduce confusion when querying GPT4o to generate anomaly descriptions. To automatically filter out low-quality annotations, we also propose to prompt GPT4o to recognize tampered texts before describing the anomaly, and to filter out the responses with low OCR accuracy. To further improve explainable tampered text detection, we propose a simple yet effective model called TextSleuth, which achieves improved fine-grained perception and cross-domain generalization by focusing on the suspected region, with a two-stage analysis paradigm and an auxiliary grounding prompt. Extensive experiments on both the ETTD dataset and the public dataset have verified the effectiveness of the proposed methods. In-depth analysis is also provided to inspire further research. Our dataset and code will be open-source.

TextSleuth: Towards Explainable Tampered Text Detection

TL;DR

This work tackles the problem of explainable tampered text detection by introducing the Explainable Tampered Text Detection (ETTD) dataset and the TextSleuth model. It combines pixel-level tampered-region annotations with natural-language anomaly descriptions generated via GPT-4o, guided by a fused mask prompt and OCR-based filtering to ensure description quality. TextSleuth employs a two-stage detection-grounding approach with an auxiliary grounding prompt to focus reasoning on the suspected region, improving both fine-grained perception and cross-domain generalization. The results demonstrate strong in-domain and cross-domain performance on ETTD and public datasets, establishing a new benchmark and offering practical insights for interpretable forensics in text-rich imagery.

Abstract

Recently, tampered text detection has attracted increasing attention due to its essential role in information security. Although existing methods can detect the tampered text region, the interpretation of such detection remains unclear, making the prediction unreliable. To address this problem, we propose to explain the basis of tampered text detection with natural language via large multimodal models. To fill the data gap for this task, we propose a large-scale, comprehensive dataset, ETTD, which contains both pixel-level annotations for tampered text region and natural language annotations describing the anomaly of the tampered text. Multiple methods are employed to improve the quality of the proposed data. For example, elaborate queries are introduced to generate high-quality anomaly descriptions with GPT4o. A fused mask prompt is proposed to reduce confusion when querying GPT4o to generate anomaly descriptions. To automatically filter out low-quality annotations, we also propose to prompt GPT4o to recognize tampered texts before describing the anomaly, and to filter out the responses with low OCR accuracy. To further improve explainable tampered text detection, we propose a simple yet effective model called TextSleuth, which achieves improved fine-grained perception and cross-domain generalization by focusing on the suspected region, with a two-stage analysis paradigm and an auxiliary grounding prompt. Extensive experiments on both the ETTD dataset and the public dataset have verified the effectiveness of the proposed methods. In-depth analysis is also provided to inspire further research. Our dataset and code will be open-source.

Paper Structure

This paper contains 18 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: We propose to both detect the tampered text region and explain the basis for the detection in natural language, making the prediction more reliable. We construct the first dataset and propose a novel model for the explainable tampered text detection task.
  • Figure 2: The pipeline for obtaining the textual anomaly description for the tampered text.
  • Figure 3: The binary mask prompt as in existing work is confusing in text images. In contrast, our proposed fused mask prompt clearly indicates the content and the exact location of the tampered text.
  • Figure 4: The overall pipeline of the proposed TextSleuth.
  • Figure 5: Our proposed textual prompt are specially designed for tampered text can can guide GPT4o to generate high-quality anomaly descriptions for tampered text.
  • ...and 4 more figures