Papilusion at DAGPap24: Paper or Illusion? Detecting AI-generated Scientific Papers

Nikita Andreev; Alexander Shirnin; Vladislav Mikhailov; Ekaterina Artemova

Papilusion at DAGPap24: Paper or Illusion? Detecting AI-generated Scientific Papers

Nikita Andreev, Alexander Shirnin, Vladislav Mikhailov, Ekaterina Artemova

TL;DR

Papilusion tackles token-level detection of AI-generated scientific text within the DAGPap24 challenge by using an ensemble of independently fine-tuned encoder models with majority voting to label each token as human-written or machine-generated. The approach starts with a competitive 89.83 F1 score during the competition but, after correcting a tokenization bug and refining hyperparameters, achieves a notable 99.46 F1 on the official test set, underscoring the impact of data handling and ensemble design. The findings show that even small to mid-sized DeBERTa variants can perform strongly under practical constraints, while data-generation artifacts like synonym replacement can influence distribution shifts and task difficulty. Overall, Papilusion demonstrates robust token-level detection capabilities and highlights practical considerations for deploying AI-generated text detectors in scientific contexts.

Abstract

This paper presents Papilusion, an AI-generated scientific text detector developed within the DAGPap24 shared task on detecting automatically generated scientific papers. We propose an ensemble-based approach and conduct ablation studies to analyze the effect of the detector configurations on the performance. Papilusion is ranked 6th on the leaderboard, and we improve our performance after the competition ended, achieving 99.46 (+9.63) of the F1-score on the official test set.

Papilusion at DAGPap24: Paper or Illusion? Detecting AI-generated Scientific Papers

TL;DR

Abstract

Paper Structure (14 sections, 1 figure, 4 tables)

This paper contains 14 sections, 1 figure, 4 tables.

Introduction
Background
Task Formulation
Performance Metric
Papilusion
Experiments
Overview
Competition-included experiments
Post-competition study
Hardware specification
Results
Competition phase
Post-competition phase
Conclusion

Figures (1)

Figure 1: The Papilusion pipeline involves fine-tuning three distinct encoder models, which are based on the same architecture but trained independently with different hyperparameters. These models use linear heads to predict labels that differentiate between human-written and machine-generated text. Finally, a majority vote is applied to aggregate the predicted labels.

Papilusion at DAGPap24: Paper or Illusion? Detecting AI-generated Scientific Papers

TL;DR

Abstract

Papilusion at DAGPap24: Paper or Illusion? Detecting AI-generated Scientific Papers

Authors

TL;DR

Abstract

Table of Contents

Figures (1)