What's Mine becomes Yours: Defining, Annotating and Detecting Context-Dependent Paraphrases in News Interview Dialogs
Anna Wegmann, Tijs van den Broek, Dong Nguyen
TL;DR
This work defines and operationalizes context-dependent paraphrases in dialog, introducing ContextDeP—a dataset of 600 guest-host utterance pairs from NPR and CNN interviews with 5,581 annotations. It provides an annotation framework for identifying paraphrase spans across turns, examines label variation, and demonstrates promising results using both token-classification with DeBERTa and in-context learning with multiple LLMs for paraphrase detection in dialog. The study highlights the challenges of ground-truth in contextual paraphrase tasks and shows that GPT-4 excels at classification while DeBERTa-based token classifiers excel at span highlighting. By releasing data, code, and models, it enables future research on evaluating and improving dialog-centered paraphrase detection and its use in dialogue systems and social science analyses.
Abstract
Best practices for high conflict conversations like counseling or customer support almost always include recommendations to paraphrase the previous speaker. Although paraphrase classification has received widespread attention in NLP, paraphrases are usually considered independent from context, and common models and datasets are not applicable to dialog settings. In this work, we investigate paraphrases in dialog (e.g., Speaker 1: "That book is mine." becomes Speaker 2: "That book is yours."). We provide an operationalization of context-dependent paraphrases, and develop a training for crowd-workers to classify paraphrases in dialog. We introduce a dataset with utterance pairs from NPR and CNN news interviews annotated for context-dependent paraphrases. To enable analyses on label variation, the dataset contains 5,581 annotations on 600 utterance pairs. We present promising results with in-context learning and with token classification models for automatic paraphrase detection in dialog.
