Table of Contents
Fetching ...

DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning

Fanwei Zeng, Changtao Miao, Jing Huang, Zhiya Tan, Shutao Gong, Xiaoming Yu, Yang Wang, Weibin Yao, Joey Tianyi Zhou, Jianshu Li, Yin Yan

Abstract

The rapid progress of generative AI has enabled increasingly realistic text-centric image forgeries, posing major challenges to document safety. Existing forensic methods mainly rely on visual cues and lack evidence-based reasoning to reveal subtle text manipulations. Detection, localization, and explanation are often treated as isolated tasks, limiting reliability and interpretability. To tackle these challenges, we propose DocShield, the first unified framework formulating text-centric forgery analysis as a visual-logical co-reasoning problem. At its core, a novel Cross-Cues-aware Chain of Thought (CCT) mechanism enables implicit agentic reasoning, iteratively cross-validating visual anomalies with textual semantics to produce consistent, evidence-grounded forensic analysis. We further introduce a Weighted Multi-Task Reward for GRPO-based optimization, aligning reasoning structure, spatial evidence, and authenticity prediction. Complementing the framework, we construct RealText-V1, a multilingual dataset of document-like text images with pixel-level manipulation masks and expert-level textual explanations. Extensive experiments show DocShield significantly outperforms existing methods, improving macro-average F1 by 41.4% over specialized frameworks and 23.4% over GPT-4o on T-IC13, with consistent gains on the challenging T-SROIE benchmark. Our dataset, model, and code will be publicly released.

DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning

Abstract

The rapid progress of generative AI has enabled increasingly realistic text-centric image forgeries, posing major challenges to document safety. Existing forensic methods mainly rely on visual cues and lack evidence-based reasoning to reveal subtle text manipulations. Detection, localization, and explanation are often treated as isolated tasks, limiting reliability and interpretability. To tackle these challenges, we propose DocShield, the first unified framework formulating text-centric forgery analysis as a visual-logical co-reasoning problem. At its core, a novel Cross-Cues-aware Chain of Thought (CCT) mechanism enables implicit agentic reasoning, iteratively cross-validating visual anomalies with textual semantics to produce consistent, evidence-grounded forensic analysis. We further introduce a Weighted Multi-Task Reward for GRPO-based optimization, aligning reasoning structure, spatial evidence, and authenticity prediction. Complementing the framework, we construct RealText-V1, a multilingual dataset of document-like text images with pixel-level manipulation masks and expert-level textual explanations. Extensive experiments show DocShield significantly outperforms existing methods, improving macro-average F1 by 41.4% over specialized frameworks and 23.4% over GPT-4o on T-IC13, with consistent gains on the challenging T-SROIE benchmark. Our dataset, model, and code will be publicly released.

Paper Structure

This paper contains 16 sections, 2 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Performance comparison on the RealText-V1 benchmark. DocShield (pink area) achieves state-of-the-art results across Detection (D), Grounding (G), and Explanation (E) tasks. M-F1 (macro-average F1 over the three tasks) serves as the unified evaluation metric, demonstrating DocShield's superior and balanced multi-task capability.
  • Figure 2: Overview of the DocShield framework. Given an input image and a prompt, the model autoregressively generates a structured, machine-readable Report. The analytical core is the Cross-Cues-aware Chain of Thought (CCT), a six-stage mechanism that iteratively extracts and cross-validates visual and logical anomalies. To ensure forensic faithfulness, the framework is optimized via Group Relative Policy Optimization (GRPO) using a Weighted Multi-Task Reward, strictly aligning spatial evidence, format compliance, and explanation fidelity.
  • Figure 3: The architecture of our PR² (Perceiver, Reasoner, Reviewer) pipeline. After an initial data collection stage, our multi-agent system generates annotations through a collaborative, iterative process. The Perceiver drafts an analysis, the Reasoner structures it to target CCT & analysis report, and the Reviewer validates its quality, initiating a refinement loop if necessary. This cycle, indicated by the solid $\bm{\rightarrow}$ and dashed $\bm{\dashleftarrow}$ feedback loops, ensures the final output is a high-fidelity, structured JSON annotation.
  • Figure 4: Qualitative comparison of artifact grounding and explanations across different methods. DocShield demonstrates superior performance, accurately identifying both visual artifacts and logical cues. Shaded regions indicate the localized tampered areas.