Towards Human Understanding of Paraphrase Types in Large Language Models

Dominik Meier; Jan Philip Wahle; Terry Ruas; Bela Gipp

Towards Human Understanding of Paraphrase Types in Large Language Models

Dominik Meier, Jan Philip Wahle, Terry Ruas, Bela Gipp

TL;DR

The paper addresses the challenge of interpreting paraphrase variation by introducing Atomic Paraphrase Types (APTs) and two human-annotated datasets (APTYBase and APTYRanked) to capture fine-grained linguistic changes. It combines generation of paraphrase candidates using ChatGPT across multiple prompting techniques with detailed human annotations and preference rankings, and evaluates LLama 7B variants trained with DPO on the ranked data. Findings show that simple APTs are approachable for current LLMs, while complex structural changes remain difficult, though prompting strategy and human preferences reveal important differences between generation success and perceived quality. The work provides datasets and insights to steer RLHF/DPO-driven improvements and to benchmark models on explicit linguistic capabilities, with broader implications for robust paraphrase generation and evaluation.

Abstract

Paraphrases represent a human's intuitive ability to understand expressions presented in various different ways. Current paraphrase evaluations of language models primarily use binary approaches, offering limited interpretability of specific text changes. Atomic paraphrase types (APT) decompose paraphrases into different linguistic changes and offer a granular view of the flexibility in linguistic expression (e.g., a shift in syntax or vocabulary used). In this study, we assess the human preferences towards ChatGPT in generating English paraphrases with ten APTs and five prompting techniques. We introduce APTY (Atomic Paraphrase TYpes), a dataset of 800 sentence-level and word-level annotations by 15 annotators. The dataset also provides a human preference ranking of paraphrases with different types that can be used to fine-tune models with RLHF and DPO methods. Our results reveal that ChatGPT and a DPO-trained LLama 7B model can generate simple APTs, such as additions and deletions, but struggle with complex structures (e.g., subordination changes). This study contributes to understanding which aspects of paraphrasing language models have already succeeded at understanding and what remains elusive. In addition, we show how our curated datasets can be used to develop language models with specific linguistic capabilities.

Towards Human Understanding of Paraphrase Types in Large Language Models

TL;DR

Abstract

Paper Structure (21 sections, 14 figures, 5 tables)

This paper contains 21 sections, 14 figures, 5 tables.

Introduction
Related Work
Methodology
Paraphrase Type Generation
Annotation
Experiments
Final Considerations
Additional Annotation Information
Full List of Considered APT
Dataset
Additional Experiments and Details
Additional Questions
Result Details
Prompts
APT Definitions
...and 6 more sections

Figures (14)

Figure 1: The generation and annotation process. During paraphrase generation (a), we select samples from the ETPC dataset kovatchev-etal-2018-etpc and prompt ChatGPT-3.5-turbo-0613 openai_gpt3.5_turbo_06 using zero-shot, one-shot, few-shot, chain-of-thought, and a fine-tuned model to generate new examples considering selected paraphrase types. For each technique and paraphrase type combination, we sample ten sentences. With five prompting techniques and ten selected APTs, we produce 500 sentence pairs. In (b), paraphrased candidates are annotated by 15 humans, who answer questions and highlight the word spans of the change. In (c), we select the generations in which the APT has been applied correctly, and in (d), the selected generations are ranked from worst to best.
Figure 2: The success rate on the y-axis in generating the specific APT on the x-axis across all tested prompting techniques of ChatGPT.
Figure 3: The success rate in generating a specific APT for different prompting techniques of ChatGPT.
Figure 4: Confusion matrices for additional and erroneous changes. The column gives the intended APT, the row the additional or wrongly applied APT.
Figure 5: The success rate in generating a specific APT for different LLama 7B Models
...and 9 more figures

Towards Human Understanding of Paraphrase Types in Large Language Models

TL;DR

Abstract

Towards Human Understanding of Paraphrase Types in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (14)