BERT-Enhanced Retrieval Tool for Homework Plagiarism Detection System
Jiarong Xian, Jibao Yuan, Peiwei Zheng, Dexian Chen, Nie yuntao
TL;DR
The paper tackles the scarcity of high-quality plagiarism datasets by using GPT-3.5 to generate a large, diverse corpus of plagiarized text pairs and labels. It proposes a two-stage detection system combining SBERT-based embeddings for fast retrieval via FAISS and a subsequent MLP classifier for multiclass plagiarism detection, achieving state-of-the-art performance on their dataset (notably 98.86% accuracy for the proposed SBERT-based model). The approach integrates data augmentation, semantic representations, and efficient retrieval to enable accurate and scalable plagiarism analysis, with a user-friendly demo platform for practical use. The work also discusses limitations of SBERT in distinguishing plagiarism degrees and provides a detailed appendix of prompts used to produce the training data.
Abstract
Text plagiarism detection task is a common natural language processing task that aims to detect whether a given text contains plagiarism or copying from other texts. In existing research, detection of high level plagiarism is still a challenge due to the lack of high quality datasets. In this paper, we propose a plagiarized text data generation method based on GPT-3.5, which produces 32,927 pairs of text plagiarism detection datasets covering a wide range of plagiarism methods, bridging the gap in this part of research. Meanwhile, we propose a plagiarism identification method based on Faiss with BERT with high efficiency and high accuracy. Our experiments show that the performance of this model outperforms other models in several metrics, including 98.86\%, 98.90%, 98.86%, and 0.9888 for Accuracy, Precision, Recall, and F1 Score, respectively. At the end, we also provide a user-friendly demo platform that allows users to upload a text library and intuitively participate in the plagiarism analysis.
