PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection Models

Andrianos Michail; Simon Clematide; Juri Opitz

PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection Models

Andrianos Michail, Simon Clematide, Juri Opitz

TL;DR

PARAPHRASUS addresses the inadequacy of single-dataset evaluation for paraphrase detection by introducing a multi-dimensional benchmark that spans 10 datasets across 3 objectives and 3 challenges. The approach combines repurposed NLP task data with two novel paraphrase resources and a careful prompting protocol to probe both trained models and LLMs under zero-shot and in-context learning scenarios, with an emphasis on different notions of paraphrase. Key contributions include the construction of two novel datasets (338 STS-H paraphrase annotations and AMR-guided paraphrase pairs), a robust unweighted scoring metric denoted as $\overline{Err}$, and extensive analyses including ablations and human agreement studies. Findings show that no model consistently excels across all paraphrase phenomena, that training on a single dataset can hinder generalization, and that prompting and data design critically shape performance; the benchmark thus provides a publicly available framework for fair model comparison and future extensions.

Abstract

The task of determining whether two texts are paraphrases has long been a challenge in NLP. However, the prevailing notion of paraphrase is often quite simplistic, offering only a limited view of the vast spectrum of paraphrase phenomena. Indeed, we find that evaluating models in a paraphrase dataset can leave uncertainty about their true semantic understanding. To alleviate this, we create PARAPHRASUS, a benchmark designed for multi-dimensional assessment, benchmarking and selection of paraphrase detection models. We find that paraphrase detection models under our fine-grained evaluation lens exhibit trade-offs that cannot be captured through a single classification dataset. Furthermore, PARAPHRASUS allows prompt calibration for different use cases, tailoring LLM models to specific strictness levels. PARAPHRASUS includes 3 challenges spanning over 10 datasets, including 8 repurposed and 2 newly annotated; we release it along with a benchmarking library at https://github.com/impresso/paraphrasus

PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection Models

TL;DR

, and extensive analyses including ablations and human agreement studies. Findings show that no model consistently excels across all paraphrase phenomena, that training on a single dataset can hinder generalization, and that prompting and data design critically shape performance; the benchmark thus provides a publicly available framework for fair model comparison and future extensions.

Abstract

Paper Structure (31 sections, 1 equation, 6 figures, 7 tables)

This paper contains 31 sections, 1 equation, 6 figures, 7 tables.

Introduction
Related Work
Paraphrasing.
Paraphrase Datasets
Creating benchmarks.
Semantic similarity datasets.
Proposed Benchmark
The data: Ten Parts with Three Objectives
Classify!
Minimize positive predictions!
Maximize positive predictions!
Evaluation Metric
Testing Paraphrase Detection Models
What can we learn from pawsx?
What can we learn from LLMs?
...and 16 more sections

Figures (6)

Figure 1: Percentage of paraphrases predicted on the Semantic Text Similarity Dataset (STSBenchmark) dataset cer-etal-2017-semeval, binned by scores from 0 (completely dissimilar) to 5 (completely equivalent). Human annotation comes from the sts-h human annotation we perform.
Figure 2: For P1, P2, and P3, the paraphrase notions we ask for are "paraphrases", "semantically equivalent" and "expressing the same content" respectively. For the ICL expanded prompt, see Appendix \ref{['sec:prompts']}.
Figure 3: Cohens $\kappa$ between humans and systems when annotating the sts-h dataset that consists of highly similar (STS Score 4-5) sentences.
Figure 4: In-Context-Learning Prompt Template. For P1, P2, and P3, the questions asked are "paraphrases," "semantically equivalent," and "expressing the same content," respectively.
Figure 5: Average Word Position Deviation (WPD) and Lexical Diversity (LD) liu-soh-2022-towards of the symmetric datasets of para-phra-sus.
...and 1 more figures

PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection Models

TL;DR

Abstract

PARAPHRASUS : A Comprehensive Benchmark for Evaluating Paraphrase Detection Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)