GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics

Yan Zhang; Simiao Ren; Ankit Raj; En Wei; Dennis Ng; Alex Shen; Jiayue Xu; Yuxin Zhang; Evelyn Marotta

GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics

Yan Zhang, Simiao Ren, Ankit Raj, En Wei, Dennis Ng, Alex Shen, Jiayue Xu, Yuxin Zhang, Evelyn Marotta

TL;DR

GPT4o-Receipt is presented, a benchmark of 1,235 receipt images pairing GPT-4o-generated receipts with authentic ones from established datasets, evaluated by five state-of-the-art multimodal LLMs and a 30-annotator crowdsourced perceptual study, revealing a striking paradox: humans are better at seeing AI artifacts, yet worse at detecting AI documents.

Abstract

Can humans detect AI-generated financial documents better than machines? We present GPT4o-Receipt, a benchmark of 1,235 receipt images pairing GPT-4o-generated receipts with authentic ones from established datasets, evaluated by five state-of-the-art multimodal LLMs and a 30-annotator crowdsourced perceptual study. Our findings reveal a striking paradox: humans are better at seeing AI artifacts, yet worse at detecting AI documents. Human annotators exhibit the largest visual discrimination gap of any evaluator, yet their binary detection F1 falls well below Claude Sonnet 4 and below Gemini 2.5 Flash. This paradox resolves once the mechanism is understood: the dominant forensic signals in AI-generated receipts are arithmetic errors -- invisible to visual inspection but systematically verifiable by LLMs. Humans cannot perceive that a subtotal is incorrect; LLMs verify it in milliseconds. Beyond the human--LLM comparison, our five-model evaluation reveals dramatic performance disparities and calibration differences that render simple accuracy metrics insufficient for detector selection. GPT4o-Receipt, the evaluation framework, and all results are released publicly to support future research in AI document forensics.

GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics

TL;DR

Abstract

Paper Structure (43 sections, 7 figures, 7 tables)

This paper contains 43 sections, 7 figures, 7 tables.

Introduction
The Rise of AI-Generated Documents
Receipts as a Multi-Dimensional Benchmark Domain
The Human--LLM Detection Gap
Contributions
Related Work
Generative Models and Numerical Hallucination
Image and Document Forgery Detection
Receipt and Financial Document Forensics
Multimodal LLMs as Forensic Detectors
The GPT4o-Receipt Dataset
AI-Generated Receipt Collection
Stage 1: Textual Receipt Generation
Stage 2: Photorealistic Image Rendering
Generation Characteristics and Known Artifacts
...and 28 more sections

Figures (7)

Figure 1: Recall vs. false positive rate for each detector. Upper-left is better. Claude Sonnet 4 achieves the highest overall detection (F1 = 0.975); Gemini 2.5 Flash exhibits the best calibration among effective detectors (lowest FPR = 0.023); Grok 4 reaches near-perfect recall at a 90.3% FPR; LLaMA 4 Scout has the lowest FPR but misses 89% of AI receipts. Human annotators (star) occupy a mid-tier position with moderate recall and low FPR.
Figure 2: Representative samples from GPT4o-Receipt. Top row: AI-generated receipts produced by the two-stage pipeline (GPT-4o text $\rightarrow$ GPT-Image-1 rendering); Bottom row: authentic receipts from ExpressExpense and Roboflow. AI-generated receipts exhibit high visual plausibility---realistic fonts, plausible merchant layouts, paper texture---but contain systematic arithmetic errors invisible to casual inspection.
Figure 3: Detection performance of five multimodal LLMs on GPT4o-Receipt. FPR (hatched, $\downarrow$ better); Accuracy, F1, Recall ($\uparrow$ better).
Figure 4: Failure rates (%) for each error category across AI-generated receipts, as assessed by each detector model. Darker red indicates higher failure rates. LLaMA 4 Scout's near-zero error detection rates are consistent with its overall failure to identify AI-generated receipts.
Figure 5: Mean visual realism scores ($\pm$ SD) for AI-generated and real receipts across all evaluators. Significance: $***$$p < 0.001$; $*$$p < 0.05$; ns not significant. Humans exhibit the largest AI-vs-real gap (1.87 points, 95% CI [1.74, 1.99]).
...and 2 more figures

GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics

TL;DR

Abstract

GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics

Authors

TL;DR

Abstract

Table of Contents

Figures (7)