ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity

Lasal Jayawardena; Prasan Yapa

ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity

Lasal Jayawardena, Prasan Yapa

TL;DR

ParaFusion tackles the lack of diversity and quality in English paraphrase datasets by leveraging LLMs to augment and generate diverse, semantically faithful paraphrases from multiple sources. It combines data from MRPC, Quora, and PAWSWiki (with careful filtering) and uses iterative prompts with gpt-3.5-turbo to produce about 2 million unique paraphrase pairs, achieving significant gains in lexical and syntactic diversity while preserving meaning. The paper introduces a comprehensive evaluation framework—including semantic, syntactic, and lexical metrics, qualitative analyses, human judgments, and GPT-4-based assessments—demonstrating at least 25% improvements in diversity metrics and proposing ParaFusion as a gold standard for future paraphrase evaluation. The work advances NLP data augmentation and model robustness, while acknowledging English-only scope, noise risks, and potential error propagation from gpt-3.5-turbo, and outlining ethical considerations such as hate speech mitigation.

Abstract

Paraphrase generation is a pivotal task in natural language processing (NLP). Existing datasets in the domain lack syntactic and lexical diversity, resulting in paraphrases that closely resemble the source sentences. Moreover, these datasets often contain hate speech and noise, and may unintentionally include non-English language sentences. This research introduces ParaFusion, a large-scale, high-quality English paraphrase dataset developed using Large Language Models (LLM) to address these challenges. ParaFusion augments existing datasets with high-quality data, significantly enhancing both lexical and syntactic diversity while maintaining close semantic similarity. It also mitigates the presence of hate speech and reduces noise, ensuring a cleaner and more focused English dataset. Results show that ParaFusion offers at least a 25% improvement in both syntactic and lexical diversity, measured across several metrics for each data source. The paper also aims to set a gold standard for paraphrase evaluation as it contains one of the most comprehensive evaluation strategies to date. The results underscore the potential of ParaFusion as a valuable resource for improving NLP applications.

ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity

TL;DR

Abstract

Paper Structure (16 sections, 6 figures, 8 tables)

This paper contains 16 sections, 6 figures, 8 tables.

Introduction
Related Work
ParaFusion
Data Sources
Base Dataset Creation
Additional Processing
Evaluation
Quantitative Analysis
Semantic Similarity
Syntactic Diversity
Lexical Diversity
Qualitative Evaluation
Human Evaluation
LLM Evaluation
Conclusion
...and 1 more sections

Figures (6)

Figure 1: High-level diagram outlining the dataset creation process.
Figure 2: This figure illustrates a sample prompt fed to the gpt-3.5-turbo model for generating diverse paraphrases.
Figure 3: This figure illustrates an instance where the paraphrase in a source dataset has only word substitutions.
Figure 4: This figure illustrates an instance where the paraphrase in a previous dataset has a different meaning.
Figure 5: This figure illustrates the prompt fed to the gpt-4 model for evaluation.
...and 1 more figures

ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity

TL;DR

Abstract

ParaFusion: A Large-Scale LLM-Driven English Paraphrase Dataset Infused with High-Quality Lexical and Syntactic Diversity

Authors

TL;DR

Abstract

Table of Contents

Figures (6)