Table of Contents
Fetching ...

THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics

Tzu-Yen Ma, Bo Zhang, Zichen Tang, Junpeng Ding, Haolin Tian, Yuanze Li, Zhuodi Hao, Zixin Ding, Zirui Wang, Xinyu Yu, Shiyao Peng, Yizhuo Zhao, Ruomeng Jiang, Yiling Huang, Peizhi Zhao, Jiayuan Chen, Weisheng Tan, Haocheng Gao, Yang Liu, Jiacheng Liu, Zhongjun Yang, Jiayu Huang, Haihong E

Abstract

We present THEMIS, a novel multi-task benchmark designed to comprehensively evaluate multimodal large language models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advances. (1) Real-World Scenarios and Complexity: Our benchmark comprises over 4,000 questions spanning seven scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 60.47% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) Fraud-Type Diversity and Granularity: THEMIS systematically covers five challenging fraud types and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these manipulations demanding a high level of visual fraud reasoning from the models. (3) Multi-Dimensional Capability Evaluation: We establish a mapping from fraud types to five core visual fraud reasoning capabilities, thereby enabling an evaluation that reveals the distinct strengths and specific weaknesses of different models across these core capabilities. Experiments on 16 leading MLLMs show that even the best-performing model, GPT-5, achieves an overall performance of only 56.15%, demonstrating that our benchmark presents a stringent test. We expect THEMIS to advance the development of MLLMs for complex, real-world fraud reasoning tasks.

THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics

Abstract

We present THEMIS, a novel multi-task benchmark designed to comprehensively evaluate multimodal large language models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advances. (1) Real-World Scenarios and Complexity: Our benchmark comprises over 4,000 questions spanning seven scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 60.47% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) Fraud-Type Diversity and Granularity: THEMIS systematically covers five challenging fraud types and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these manipulations demanding a high level of visual fraud reasoning from the models. (3) Multi-Dimensional Capability Evaluation: We establish a mapping from fraud types to five core visual fraud reasoning capabilities, thereby enabling an evaluation that reveals the distinct strengths and specific weaknesses of different models across these core capabilities. Experiments on 16 leading MLLMs show that even the best-performing model, GPT-5, achieves an overall performance of only 56.15%, demonstrating that our benchmark presents a stringent test. We expect THEMIS to advance the development of MLLMs for complex, real-world fraud reasoning tasks.

Paper Structure

This paper contains 16 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of THEMIS. (1) Real-World Scenarios and Complexity, comprising over 4,000 questions across 7 representative scenarios; (2) Fraud-Type Diversity and Granularity, covering 5 challenging fraud methods with 16 fine-grained manipulation operations; (3) Multi-Dimensional Capability Evaluation, mapping fraud types to 5 core reasoning capabilities.
  • Figure 2: Statistics of THEMIS.(a) Distribution of fraud methods. (b) Distribution of manipulation operations (synthetic data). See Table \ref{['tab:right_table']} for real cases. IIF: Image Inference Forgery; TRE: Targeted Region Editing; CT: Color Temperature; DR: Direct Reuse; HF: Horizontal Flip; VF: Vertical Flip.
  • Figure 3: Dataset construction pipeline of THEMIS. The dataset is built through 2 stages: Stage 1: Extraction and Parsing, where figures, captions, and related sentences are parsed from scientific PDFs and segmented into panels; Stage 2: Fraud Data Generation, where 5 major fraud types (Splicing, Copy-Move, AI-Generated, Duplication, and Text--Image Inconsistency) are applied to construct challenging tasks.
  • Figure 4: Evaluation task design of THEMIS. A principled mapping from 5 fraud types to 5 core reasoning capabilities (Expert Knowledge Utilization, Visual Recognition, Spatial Reasoning, Region Localization, and Comparative Reasoning). The capability distribution bars on the right of each box illustrate the reasoning skills involved and their relative emphasis, with the darkest color highlighting the primary capability being evaluated.
  • Figure 5: Impact of different factors on model performance using synthetic data.SPL: Splicing; CM: Copy-Move; AIG: AI-Generated; DUP: Duplication; DUP-Id: Id Score for Duplication; TII-Id: Id Score for Text--Image Inconsistency; TII-Loc: Loc Score for Text--Image Inconsistency; SMF-Id: Average Id Score of Single-Mode Forgery Identification; SMF-Loc: Average Loc Score of Single-Mode Forgery Localization.