Table of Contents
Fetching ...

DOCFORGE-BENCH: A Comprehensive Benchmark for Document Forgery Detection and Analysis

Zengqi Zhao, Weidi Xia, Peter Wei, Yan Zhang, Yiyi Zhang, Jane Mo, Tiannan Zhang, Yuanqin Dai, Zexi Chen, Simiao Ren

TL;DR

DOCFORGE-BENCH is the first unified zero-shot benchmark for document forgery detection, evaluating 14 methods across eight datasets spanning text tampering, receipt forgery, and identity document manipulation, with a central finding is a pervasive calibration failure invisible under single-threshold protocols.

Abstract

We present DOCFORGE-BENCH, the first unified zero-shot benchmark for document forgery detection, evaluating 14 methods across eight datasets spanning text tampering, receipt forgery, and identity document manipulation. Unlike fine-tuning-oriented evaluations such as ForensicHub [Du et al., 2025], DOCFORGE-BENCH applies all methods with their published pretrained weights and no domain adaptation -- a deliberate design choice that reflects the realistic deployment scenario where practitioners lack labeled document training data. Our central finding is a pervasive calibration failure invisible under single-threshold protocols: methods achieve moderate Pixel-AUC (>=0.76) yet near-zero Pixel-F1. This AUC-F1 gap is not a discrimination failure but a score-distribution shift: tampered regions occupy only 0.27-4.17% of pixels in document images -- an order of magnitude less than in natural image benchmarks -- making the standard tau=0.5 threshold catastrophically miscalibrated. Oracle-F1 is 2-10x higher than fixed-threshold Pixel-F1, confirming that calibration, not representation, is the bottleneck. A controlled calibration experiment validates this: adapting a single threshold on N=10 domain images recovers 39-55% of the Oracle-F1 gap, demonstrating that threshold adaptation -- not retraining -- is the key missing step for practical deployment. Overall, no evaluated method works reliably out-of-the-box on diverse document types, underscoring that document forgery detection remains an unsolved problem. We further note that all eight datasets predate the era of generative AI editing; benchmarks covering diffusion- and LLM-based document forgeries represent a critical open gap on the modern attack surface.

DOCFORGE-BENCH: A Comprehensive Benchmark for Document Forgery Detection and Analysis

TL;DR

DOCFORGE-BENCH is the first unified zero-shot benchmark for document forgery detection, evaluating 14 methods across eight datasets spanning text tampering, receipt forgery, and identity document manipulation, with a central finding is a pervasive calibration failure invisible under single-threshold protocols.

Abstract

We present DOCFORGE-BENCH, the first unified zero-shot benchmark for document forgery detection, evaluating 14 methods across eight datasets spanning text tampering, receipt forgery, and identity document manipulation. Unlike fine-tuning-oriented evaluations such as ForensicHub [Du et al., 2025], DOCFORGE-BENCH applies all methods with their published pretrained weights and no domain adaptation -- a deliberate design choice that reflects the realistic deployment scenario where practitioners lack labeled document training data. Our central finding is a pervasive calibration failure invisible under single-threshold protocols: methods achieve moderate Pixel-AUC (>=0.76) yet near-zero Pixel-F1. This AUC-F1 gap is not a discrimination failure but a score-distribution shift: tampered regions occupy only 0.27-4.17% of pixels in document images -- an order of magnitude less than in natural image benchmarks -- making the standard tau=0.5 threshold catastrophically miscalibrated. Oracle-F1 is 2-10x higher than fixed-threshold Pixel-F1, confirming that calibration, not representation, is the bottleneck. A controlled calibration experiment validates this: adapting a single threshold on N=10 domain images recovers 39-55% of the Oracle-F1 gap, demonstrating that threshold adaptation -- not retraining -- is the key missing step for practical deployment. Overall, no evaluated method works reliably out-of-the-box on diverse document types, underscoring that document forgery detection remains an unsolved problem. We further note that all eight datasets predate the era of generative AI editing; benchmarks covering diffusion- and LLM-based document forgeries represent a critical open gap on the modern attack surface.
Paper Structure (73 sections, 6 equations, 8 figures, 7 tables)

This paper contains 73 sections, 6 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Pixel-F1 (left) and Pixel-AUC (right) for all 14 evaluated methods---document-specific (above separator) and general forensic (below)---across eight document datasets. Pixel-AUC is consistently moderate to high while Pixel-F1 at fixed $\tau{=}0.5$ remains near zero for most (method, dataset) pairs. The pervasive AUC--F1 gap confirms calibration failure---not discriminative failure---as the dominant bottleneck: methods correctly rank tampered pixels above authentic ones but cannot identify a usable decision threshold in the document domain. ForensicHub methods (FFDN, CAFTB, TIFDM) achieve AUC $>$ 0.90 on multiple datasets where Pixel-F1 is below 0.05, confirming the calibration gap is not resolved by document-specific training. Appendix Fig. \ref{['fig:secondary_metrics']} shows Pixel-IoU and Oracle F1.
  • Figure 2: Mean cross-domain Pixel-F1 (left) and Pixel-AUC (right) for all 14 methods across the five cross-domain datasets (RealTextManipulation, Tampered-IC13, ReceiptForgery, MixTamper, FSTS-1.5k). Error bars show standard deviation. The dashed vertical line marks the mean across all general methods. Despite document-specific training, CAFTB-Net is the only doc-specific method that clearly outperforms both TruFor and CAT-Net on F1; on AUC, the two method families overlap substantially, confirming that calibration---not feature discrimination---distinguishes the groups.
  • Figure 3: Calibration failure across all 14 methods: Pixel-F1 @ $\tau{=}0.5$ (dark blue), Oracle F1 at best threshold (light blue), and Pixel-AUC (green), all averaged across the eight document datasets. The consistent ordering AUC $\gg$ Oracle F1 $>$ Pixel-F1 holds for every method, confirming that score-distribution shift---not feature discrimination---is the primary bottleneck. A dashed vertical line separates document-specific (left) from general methods (right).
  • Figure 4: Pixel-IoU (left) and Oracle F1 (right) for all 14 evaluated methods across eight document datasets. Pixel-IoU tracks Pixel-F1 closely ($\mathrm{IoU} = \mathrm{F1}/(2-\mathrm{F1})$) and is included for comparison with prior work. Oracle F1 is the best achievable F1 at any threshold per image; the large gap between Oracle F1 and the fixed-threshold Pixel-F1 in Fig. \ref{['fig:all_metrics']} quantifies calibration error across the document domain.
  • Figure 5: Distribution of Pixel-F1 across eight datasets per method, shown as horizontal box plots sorted by median (descending). Individual dataset scores are overlaid as jittered points. Red methods are document-specific; blue are general forensics. The wide interquartile ranges confirm that no method generalises uniformly: a method can have the highest median while still scoring near zero on at least two datasets.
  • ...and 3 more figures