Table of Contents
Fetching ...

PlagBench: Exploring the Duality of Large Language Models in Plagiarism Generation and Detection

Jooyoung Lee, Toshini Agrawal, Adaku Uchendu, Thai Le, Jinghui Chen, Dongwon Lee

TL;DR

PlagBench addresses the dual challenge of LLM-driven plagiarism generation and detection by introducing a 46.5K paired-text benchmark across verbatim, paraphrase, and summary plagiarism in three domains. The authors combine automatic and human QA to curate high-quality generation samples from three LLMs and evaluate both LLM-based detectors and traditional detectors under diverse prompting strategies. Key findings show GPT-3.5 Turbo often excels at paraphrase/summarization quality, while GPT-4 Turbo leads in detection performance, with several LLMs surpassing commercial detectors under few-shot CoT prompts. The dataset and accompanying code constitute a robust resource for developing and benchmarking robust, domain-aware plagiarism detection systems, though detecting summary plagiarism remains notably challenging. This work highlights the evolving capabilities of LLMs in both creating and identifying plagiarized content and sets a standard for rigorous evaluation in this area.

Abstract

Recent studies have raised concerns about the potential threats large language models (LLMs) pose to academic integrity and copyright protection. Yet, their investigation is predominantly focused on literal copies of original texts. Also, how LLMs can facilitate the detection of LLM-generated plagiarism remains largely unexplored. To address these gaps, we introduce \textbf{\sf PlagBench}, a dataset of 46.5K synthetic text pairs that represent three major types of plagiarism: verbatim copying, paraphrasing, and summarization. These samples are generated by three advanced LLMs. We rigorously validate the quality of PlagBench through a combination of fine-grained automatic evaluation and human annotation. We then utilize this dataset for two purposes: (1) to examine LLMs' ability to transform original content into accurate paraphrases and summaries, and (2) to evaluate the plagiarism detection performance of five modern LLMs alongside three specialized plagiarism checkers. Our results show that GPT-3.5 Turbo can produce high-quality paraphrases and summaries without significantly increasing text complexity compared to GPT-4 Turbo. However, in terms of detection, GPT-4 outperforms other LLMs and commercial detection tools by 20%, highlights the evolving capabilities of LLMs not only in content generation but also in plagiarism detection. Data and source code are available at https://github.com/Brit7777/plagbench.

PlagBench: Exploring the Duality of Large Language Models in Plagiarism Generation and Detection

TL;DR

PlagBench addresses the dual challenge of LLM-driven plagiarism generation and detection by introducing a 46.5K paired-text benchmark across verbatim, paraphrase, and summary plagiarism in three domains. The authors combine automatic and human QA to curate high-quality generation samples from three LLMs and evaluate both LLM-based detectors and traditional detectors under diverse prompting strategies. Key findings show GPT-3.5 Turbo often excels at paraphrase/summarization quality, while GPT-4 Turbo leads in detection performance, with several LLMs surpassing commercial detectors under few-shot CoT prompts. The dataset and accompanying code constitute a robust resource for developing and benchmarking robust, domain-aware plagiarism detection systems, though detecting summary plagiarism remains notably challenging. This work highlights the evolving capabilities of LLMs in both creating and identifying plagiarized content and sets a standard for rigorous evaluation in this area.

Abstract

Recent studies have raised concerns about the potential threats large language models (LLMs) pose to academic integrity and copyright protection. Yet, their investigation is predominantly focused on literal copies of original texts. Also, how LLMs can facilitate the detection of LLM-generated plagiarism remains largely unexplored. To address these gaps, we introduce \textbf{\sf PlagBench}, a dataset of 46.5K synthetic text pairs that represent three major types of plagiarism: verbatim copying, paraphrasing, and summarization. These samples are generated by three advanced LLMs. We rigorously validate the quality of PlagBench through a combination of fine-grained automatic evaluation and human annotation. We then utilize this dataset for two purposes: (1) to examine LLMs' ability to transform original content into accurate paraphrases and summaries, and (2) to evaluate the plagiarism detection performance of five modern LLMs alongside three specialized plagiarism checkers. Our results show that GPT-3.5 Turbo can produce high-quality paraphrases and summaries without significantly increasing text complexity compared to GPT-4 Turbo. However, in terms of detection, GPT-4 outperforms other LLMs and commercial detection tools by 20%, highlights the evolving capabilities of LLMs not only in content generation but also in plagiarism detection. Data and source code are available at https://github.com/Brit7777/plagbench.

Paper Structure

This paper contains 17 sections, 2 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: The overview of PlagBench construction processes and proposed RQs. Blue and red arrows denote the flow of RQ1 and RQ2, respectively.
  • Figure 2: Binary plagiarism detection (no plagiarism vs. plagiarism) performance of 5 LLMs w.r.t. prompt types. Dotted lines represent the performance of non-LLM based detectors.
  • Figure 3: Mean paraphrase evaluation aspect scores w.r.t. domain and model types (automatic evaluation). For BERTScore and Alignscore, a higher score indicates greater semantic similarity and consistency between the LLM-paraphrased text and the source text. A higher readability improvement score suggests that the LLM-paraphrased text is more complex and requires a higher level of education to understand compared to the source text.
  • Figure 4: Mean summary evaluation aspect scores w.r.t. domain and model types (automatic evaluation). The lower the BARTScore the more coherent given a pair of the source text and the LLM-summarized text. For Alignscore and BLANC score, a higher score indicates greater consistency and relevancy between the LLM-summarized text and the source text. A higher readability score suggests that the LLM-summarized text is more complex and requires a higher level of education to understand compared to the source text.