Table of Contents
Fetching ...

Paraphrase Identification with Deep Learning: A Review of Datasets and Methods

Chao Zhou, Cheng Qiu, Lizhen Liang, Daniel E. Acuna

TL;DR

ReParaphrased is introduced and validated, a refined paraphrase typology is validated, and the Extended Typology Paraphrase Corpus is extended with meticulous manual annotations to enhance reliability, revealing how the under-representation of certain paraphrase types in widely-used datasets, including those for training Large Language Models (LLMs), undermines plagiarism detection accuracy.

Abstract

The rapid progress of Natural Language Processing (NLP) technologies has led to the widespread availability and effectiveness of text generation tools such as ChatGPT and Claude. While highly useful, these technologies also pose significant risks to the credibility of various media forms if they are employed for paraphrased plagiarism -- one of the most subtle forms of content misuse in scientific literature and general text media. Although automated methods for paraphrase identification have been developed, detecting this type of plagiarism remains challenging due to the inconsistent nature of the datasets used to train these methods. In this article, we examine traditional and contemporary approaches to paraphrase identification, investigating how the under-representation of certain paraphrase types in popular datasets, including those used to train Large Language Models (LLMs), affects the ability to detect plagiarism. We introduce and validate a new refined typology for paraphrases (ReParaphrased, REfined PARAPHRASE typology definitions) to better understand the disparities in paraphrase type representation. Lastly, we propose new directions for future research and dataset development to enhance AI-based paraphrase detection.

Paraphrase Identification with Deep Learning: A Review of Datasets and Methods

TL;DR

ReParaphrased is introduced and validated, a refined paraphrase typology is validated, and the Extended Typology Paraphrase Corpus is extended with meticulous manual annotations to enhance reliability, revealing how the under-representation of certain paraphrase types in widely-used datasets, including those for training Large Language Models (LLMs), undermines plagiarism detection accuracy.

Abstract

The rapid progress of Natural Language Processing (NLP) technologies has led to the widespread availability and effectiveness of text generation tools such as ChatGPT and Claude. While highly useful, these technologies also pose significant risks to the credibility of various media forms if they are employed for paraphrased plagiarism -- one of the most subtle forms of content misuse in scientific literature and general text media. Although automated methods for paraphrase identification have been developed, detecting this type of plagiarism remains challenging due to the inconsistent nature of the datasets used to train these methods. In this article, we examine traditional and contemporary approaches to paraphrase identification, investigating how the under-representation of certain paraphrase types in popular datasets, including those used to train Large Language Models (LLMs), affects the ability to detect plagiarism. We introduce and validate a new refined typology for paraphrases (ReParaphrased, REfined PARAPHRASE typology definitions) to better understand the disparities in paraphrase type representation. Lastly, we propose new directions for future research and dataset development to enhance AI-based paraphrase detection.
Paper Structure (57 sections, 6 figures, 7 tables)

This paper contains 57 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Traditional Approaches and Techniques on paraphrase identification are mainly classified as knowledge-based and corpus-based. Each of them has various branches. We review the three most used ones for each category.
  • Figure 2: Early Architectures consist of simple embedding approaches and shallow neural networks on PI tasks. Traditional deep neural networks (TDNNs) contain two mainstream neural networks: CNNs and RNNs, and their improvements on PI tasks. Mechanism Modules include substantial independent improvements on PI tasks. Transformer-based structures incorporate modern transformers on PI or downstream tasks.
  • Figure 3: The Tensor layer and k-Max pooling mechanism in Multipossen2016. This figure cites from Multipossen2016.
  • Figure 4: Illustration of the proposed Co-Stack Residual Affinity Network (CSRAN) architecturetay2018co. Each color-coded matrix represents the interactions between two layers of sequence A and sequence B. This figure cite from tay2018co.
  • Figure 5: The hybrid deep neural architecture for robust paraphrase identification model by agarwal_deep_2018. The architecture visually encapsulates the symbiotic fusion of CNNs and LSTMs, exemplifying the innovation's prowess. This figure is cited from agarwal_deep_2018.
  • ...and 1 more figures