Table of Contents
Fetching ...

GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics

Yan Zhang, Simiao Ren, Ankit Raj, En Wei, Dennis Ng, Alex Shen, Jiayue Xu, Yuxin Zhang, Evelyn Marotta

TL;DR

GPT4o-Receipt is presented, a benchmark of 1,235 receipt images pairing GPT-4o-generated receipts with authentic ones from established datasets, evaluated by five state-of-the-art multimodal LLMs and a 30-annotator crowdsourced perceptual study, revealing a striking paradox: humans are better at seeing AI artifacts, yet worse at detecting AI documents.

Abstract

Can humans detect AI-generated financial documents better than machines? We present GPT4o-Receipt, a benchmark of 1,235 receipt images pairing GPT-4o-generated receipts with authentic ones from established datasets, evaluated by five state-of-the-art multimodal LLMs and a 30-annotator crowdsourced perceptual study. Our findings reveal a striking paradox: humans are better at seeing AI artifacts, yet worse at detecting AI documents. Human annotators exhibit the largest visual discrimination gap of any evaluator, yet their binary detection F1 falls well below Claude Sonnet 4 and below Gemini 2.5 Flash. This paradox resolves once the mechanism is understood: the dominant forensic signals in AI-generated receipts are arithmetic errors -- invisible to visual inspection but systematically verifiable by LLMs. Humans cannot perceive that a subtotal is incorrect; LLMs verify it in milliseconds. Beyond the human--LLM comparison, our five-model evaluation reveals dramatic performance disparities and calibration differences that render simple accuracy metrics insufficient for detector selection. GPT4o-Receipt, the evaluation framework, and all results are released publicly to support future research in AI document forensics.

GPT4o-Receipt: A Dataset and Human Study for AI-Generated Document Forensics

TL;DR

GPT4o-Receipt is presented, a benchmark of 1,235 receipt images pairing GPT-4o-generated receipts with authentic ones from established datasets, evaluated by five state-of-the-art multimodal LLMs and a 30-annotator crowdsourced perceptual study, revealing a striking paradox: humans are better at seeing AI artifacts, yet worse at detecting AI documents.

Abstract

Can humans detect AI-generated financial documents better than machines? We present GPT4o-Receipt, a benchmark of 1,235 receipt images pairing GPT-4o-generated receipts with authentic ones from established datasets, evaluated by five state-of-the-art multimodal LLMs and a 30-annotator crowdsourced perceptual study. Our findings reveal a striking paradox: humans are better at seeing AI artifacts, yet worse at detecting AI documents. Human annotators exhibit the largest visual discrimination gap of any evaluator, yet their binary detection F1 falls well below Claude Sonnet 4 and below Gemini 2.5 Flash. This paradox resolves once the mechanism is understood: the dominant forensic signals in AI-generated receipts are arithmetic errors -- invisible to visual inspection but systematically verifiable by LLMs. Humans cannot perceive that a subtotal is incorrect; LLMs verify it in milliseconds. Beyond the human--LLM comparison, our five-model evaluation reveals dramatic performance disparities and calibration differences that render simple accuracy metrics insufficient for detector selection. GPT4o-Receipt, the evaluation framework, and all results are released publicly to support future research in AI document forensics.
Paper Structure (43 sections, 7 figures, 7 tables)

This paper contains 43 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Recall vs. false positive rate for each detector. Upper-left is better. Claude Sonnet 4 achieves the highest overall detection (F1 = 0.975); Gemini 2.5 Flash exhibits the best calibration among effective detectors (lowest FPR = 0.023); Grok 4 reaches near-perfect recall at a 90.3% FPR; LLaMA 4 Scout has the lowest FPR but misses 89% of AI receipts. Human annotators (star) occupy a mid-tier position with moderate recall and low FPR.
  • Figure 2: Representative samples from GPT4o-Receipt. Top row: AI-generated receipts produced by the two-stage pipeline (GPT-4o text $\rightarrow$ GPT-Image-1 rendering); Bottom row: authentic receipts from ExpressExpense and Roboflow. AI-generated receipts exhibit high visual plausibility---realistic fonts, plausible merchant layouts, paper texture---but contain systematic arithmetic errors invisible to casual inspection.
  • Figure 3: Detection performance of five multimodal LLMs on GPT4o-Receipt. FPR (hatched, $\downarrow$ better); Accuracy, F1, Recall ($\uparrow$ better).
  • Figure 4: Failure rates (%) for each error category across AI-generated receipts, as assessed by each detector model. Darker red indicates higher failure rates. LLaMA 4 Scout's near-zero error detection rates are consistent with its overall failure to identify AI-generated receipts.
  • Figure 5: Mean visual realism scores ($\pm$ SD) for AI-generated and real receipts across all evaluators. Significance: $***$$p < 0.001$; $*$$p < 0.05$; ns not significant. Humans exhibit the largest AI-vs-real gap (1.87 points, 95% CI [1.74, 1.99]).
  • ...and 2 more figures