Table of Contents
Fetching ...

Papilusion at DAGPap24: Paper or Illusion? Detecting AI-generated Scientific Papers

Nikita Andreev, Alexander Shirnin, Vladislav Mikhailov, Ekaterina Artemova

TL;DR

Papilusion tackles token-level detection of AI-generated scientific text within the DAGPap24 challenge by using an ensemble of independently fine-tuned encoder models with majority voting to label each token as human-written or machine-generated. The approach starts with a competitive 89.83 F1 score during the competition but, after correcting a tokenization bug and refining hyperparameters, achieves a notable 99.46 F1 on the official test set, underscoring the impact of data handling and ensemble design. The findings show that even small to mid-sized DeBERTa variants can perform strongly under practical constraints, while data-generation artifacts like synonym replacement can influence distribution shifts and task difficulty. Overall, Papilusion demonstrates robust token-level detection capabilities and highlights practical considerations for deploying AI-generated text detectors in scientific contexts.

Abstract

This paper presents Papilusion, an AI-generated scientific text detector developed within the DAGPap24 shared task on detecting automatically generated scientific papers. We propose an ensemble-based approach and conduct ablation studies to analyze the effect of the detector configurations on the performance. Papilusion is ranked 6th on the leaderboard, and we improve our performance after the competition ended, achieving 99.46 (+9.63) of the F1-score on the official test set.

Papilusion at DAGPap24: Paper or Illusion? Detecting AI-generated Scientific Papers

TL;DR

Papilusion tackles token-level detection of AI-generated scientific text within the DAGPap24 challenge by using an ensemble of independently fine-tuned encoder models with majority voting to label each token as human-written or machine-generated. The approach starts with a competitive 89.83 F1 score during the competition but, after correcting a tokenization bug and refining hyperparameters, achieves a notable 99.46 F1 on the official test set, underscoring the impact of data handling and ensemble design. The findings show that even small to mid-sized DeBERTa variants can perform strongly under practical constraints, while data-generation artifacts like synonym replacement can influence distribution shifts and task difficulty. Overall, Papilusion demonstrates robust token-level detection capabilities and highlights practical considerations for deploying AI-generated text detectors in scientific contexts.

Abstract

This paper presents Papilusion, an AI-generated scientific text detector developed within the DAGPap24 shared task on detecting automatically generated scientific papers. We propose an ensemble-based approach and conduct ablation studies to analyze the effect of the detector configurations on the performance. Papilusion is ranked 6th on the leaderboard, and we improve our performance after the competition ended, achieving 99.46 (+9.63) of the F1-score on the official test set.
Paper Structure (14 sections, 1 figure, 4 tables)

This paper contains 14 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: The Papilusion pipeline involves fine-tuning three distinct encoder models, which are based on the same architecture but trained independently with different hyperparameters. These models use linear heads to predict labels that differentiate between human-written and machine-generated text. Finally, a majority vote is applied to aggregate the predicted labels.