Overview of the Plagiarism Detection Task at PAN 2025
André Greiner-Petter, Maik Fröbe, Jan Philip Wahle, Terry Ruas, Bela Gipp, Akiko Aizawa, Martin Potthast
TL;DR
This paper revives the PAN plagiarism detection task in the era of generative AI by creating a large-scale, automatically generated plagiarism dataset built from $S$–$P$ document pairs, where passages in $P$ are paraphrased from semantically similar sources $S$ using three LLMs. The authors adopt a retrieval-oriented evaluation framework, combining semantic, lexical, and structural cues to generate high-quality alignments and benchmarking four participant systems against several baselines, highlighting notable gains from specialized text-retrieval models like Linq-Embed-Mistral. Key findings show that embedding-based detectors can achieve recall as high as around $0.8$ but precision around $0.5$, and that generalization to older PAN datasets remains challenging, suggesting a need for more diverse data generation and robust retrieval strategies. The work underscores the evolving landscape of plagiarism detection in the presence of advancing LLMs and argues for broader domains, realistic generation, and citation-awareness to keep the task relevant for practice.
Abstract
The generative plagiarism detection task at PAN 2025 aims at identifying automatically generated textual plagiarism in scientific articles and aligning them with their respective sources. We created a novel large-scale dataset of automatically generated plagiarism using three large language models: Llama, DeepSeek-R1, and Mistral. In this task overview paper, we outline the creation of this dataset, summarize and compare the results of all participants and four baselines, and evaluate the results on the last plagiarism detection task from PAN 2015 in order to interpret the robustness of the proposed approaches. We found that the current iteration does not invite a large variety of approaches as naive semantic similarity approaches based on embedding vectors provide promising results of up to 0.8 recall and 0.5 precision. In contrast, most of these approaches underperform significantly on the 2015 dataset, indicating a lack in generalizability.
