Table of Contents
Fetching ...

AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents

Jiaqi Wu, Yuchen Zhou, Muduo Xu, Zisheng Liang, Simiao Ren, Jiayu Xue, Meige Yang, Siying Chen, Jingheng Huan

TL;DR

The first dedicated benchmark targeting exclusively diffusion-model-based inpainting in financial and form documents with pixel-level annotation is presented, confirming that AI-forged values are indistinguishable to automated detectors and VLMs.

Abstract

We present AIForge-Doc, the first dedicated benchmark targeting exclusively diffusion-model-based inpainting in financial and form documents with pixel-level annotation. Existing document forgery datasets rely on traditional digital editing tools (e.g., Adobe Photoshop, GIMP), creating a critical gap: state-of-the-art detectors are blind to the rapidly growing threat of AI-forged document fraud. AIForge-Doc addresses this gap by systematically forging numeric fields in real-world receipt and form images using two AI inpainting APIs -- Gemini 2.5 Flash Image and Ideogram v2 Edit -- yielding 4,061 forged images from four public document datasets (CORD, WildReceipt, SROIE, XFUND) across nine languages, annotated with pixel-precise tampered-region masks in DocTamper-compatible format. We benchmark three representative detectors -- TruFor, DocTamper, and a zero-shot GPT-4o judge -- and find that all existing methods degrade substantially: TruFor achieves AUC=0.751 (zero-shot, out-of-distribution) vs. AUC=0.96 on NIST16; DocTamper achieves AUC=0.563 vs. AUC=0.98 in-distribution, with pixel-level IoU=0.020; GPT-4o achieves only 0.509 -- essentially at chance -- confirming that AI-forged values are indistinguishable to automated detectors and VLMs. These results demonstrate that AIForge-Doc represents a qualitatively new and unsolved challenge for document forensics.

AIForge-Doc: A Benchmark for Detecting AI-Forged Tampering in Financial and Form Documents

TL;DR

The first dedicated benchmark targeting exclusively diffusion-model-based inpainting in financial and form documents with pixel-level annotation is presented, confirming that AI-forged values are indistinguishable to automated detectors and VLMs.

Abstract

We present AIForge-Doc, the first dedicated benchmark targeting exclusively diffusion-model-based inpainting in financial and form documents with pixel-level annotation. Existing document forgery datasets rely on traditional digital editing tools (e.g., Adobe Photoshop, GIMP), creating a critical gap: state-of-the-art detectors are blind to the rapidly growing threat of AI-forged document fraud. AIForge-Doc addresses this gap by systematically forging numeric fields in real-world receipt and form images using two AI inpainting APIs -- Gemini 2.5 Flash Image and Ideogram v2 Edit -- yielding 4,061 forged images from four public document datasets (CORD, WildReceipt, SROIE, XFUND) across nine languages, annotated with pixel-precise tampered-region masks in DocTamper-compatible format. We benchmark three representative detectors -- TruFor, DocTamper, and a zero-shot GPT-4o judge -- and find that all existing methods degrade substantially: TruFor achieves AUC=0.751 (zero-shot, out-of-distribution) vs. AUC=0.96 on NIST16; DocTamper achieves AUC=0.563 vs. AUC=0.98 in-distribution, with pixel-level IoU=0.020; GPT-4o achieves only 0.509 -- essentially at chance -- confirming that AI-forged values are indistinguishable to automated detectors and VLMs. These results demonstrate that AIForge-Doc represents a qualitatively new and unsolved challenge for document forensics.
Paper Structure (64 sections, 9 figures, 3 tables)

This paper contains 64 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: AIForge-Doc: AI-inpainted document forgeries pass visual inspection and are difficult to distinguish from authentic documents. Each row: authentic document with target field highlighted (left), AI-forged version (center), pixel-precise ground-truth mask (right). From top: CORD receipt (Ideogram v2 Edit), WildReceipt (Gemini 2.5 Flash Image), SROIE receipt (Gemini), XFUND multilingual form (Ideogram v2 Edit). The tampered region---a single numeric field---comprises a median of 0.9% of image pixels, yet contains the forensically critical edit.
  • Figure 2: AIForge-Doc generation pipeline.Top: the overall dataset creation workflow, from source datasets through field selection and tool assignment (informed by a 320-trial prompt ablation study) to the final DocTamper-compatible dataset. Bottom: the per-image context-window inpainting technique---starting from a source document (a), we expand a context crop (b), create a binary inpainting mask (c), feed the crop and mask to the AI API (d), and paste only the field region back into the full image (e) to produce the forged output and pixel-precise ground-truth mask (f).
  • Figure 3: Prompt ablation: 5 representative outputs per rejected tool on WildReceipt image 000002699 (Prod_item_key, "HULAHAWAIIANT1"$\to$"HULAHAWAIIANT4"). Each row shows the Gemini 2.5 Flash reference (green border) alongside outputs from four prompt strategies spanning minimal, OCR-focused, step-by-step, and color-aware prompts. FLUX Fill Pro renders plausible digit shapes but consistently wrong values; GPT-Image-1 produces a black N/A patch or blurry wrong-font text at 512 px; SD 3.5 Medium renders prompt text literally or produces garbled symbols; SD 1.5 Inpainting yields illegible blurred patches. No prompt strategy succeeds for any rejected tool. Full 20-prompt-variant grids are in \ref{['fig:ablation_flux', 'fig:ablation_gpt', 'fig:ablation_sd35', 'fig:ablation_sd15']} (Appendix \ref{['app:ablation']}).
  • Figure 4: Distribution of tampered pixel fraction across 4,061 AIForge-Doc images. Median: 0.92% (vertical blue line); IQR: [0.35%, 1.55%]. Over 99% of pixels in each image are unmodified---the tampered region is a small, localized field bbox.
  • Figure 5: NoisePrint++ heatmaps for two AIForge-Doc examples. Each row shows: original document crop (left), AI-forged version (center-left), ground-truth mask (center-right), and TruFor NoisePrint++ heatmap (right; hot colormap, 0 = authentic, 1 = forged). Top row (Gemini 2.5 Flash Image, TruFor score = 1.000): TruFor assigns high confidence to this Gemini forgery; the heatmap shows elevated response near the tampered bbox (cyan rectangle), suggesting residual noise-level artifacts from Gemini's generation process. Bottom row (Ideogram v2 Edit, TruFor score = 0.505): TruFor is near-random on this Ideogram forgery; the heatmap is diffuse and uniform, indicating that Ideogram's inpainting leaves no detectable sensor-level signature---consistent with its per-tool AUC of 0.521 (CI spans 0.50, Table \ref{['tab:per_tool']}). The per-tool gap ($\Delta$AUC = 0.257) shows that different AI generators produce forgeries of qualitatively different detectability.
  • ...and 4 more figures