Table of Contents
Fetching ...

CoPrUS: Consistency Preserving Utterance Synthesis towards more realistic benchmark dialogues

Sebastian Steindl, Ulrich Schäfer, Bernd Ludwig

TL;DR

This work targets realism gaps in task-oriented dialogue benchmarks by injecting post-hoc miscommunications into Wizard-of-Oz dialogues. It introduces CoPrUS, a two-step prompting pipeline that uses a large language model to generate miscommunication utterances (MU, NU, VQ) and corresponding repairing utterances, with Prometheus 2-based automatic quality assurance. The authors apply CoPrUS to MultiWOZ 2.1, creating the CoPrUS-MultiWOZ dataset (~1900 dialogues) and demonstrating that the augmented data yields comparable or modest improvements across NLG, NLU, and DST tasks without degrading downstream performance. The dataset and methodology enable more realistic benchmarks and pave the way for future research on error handling and recovery in dialogue systems, while acknowledging limitations in generation quality and the need for broader taxonomy and real-world validation.

Abstract

Large-scale Wizard-Of-Oz dialogue datasets have enabled the training of deep learning-based dialogue systems. While they are successful as benchmark datasets, they lack certain types of utterances, which would make them more realistic. In this work, we investigate the creation of synthetic communication errors in an automatic pipeline. Based on linguistic theory, we propose and follow a simple error taxonomy. We focus on three types of miscommunications that could happen in real-world dialogues but are underrepresented in the benchmark dataset: misunderstandings, non-understandings and vaguely related questions. Our two-step approach uses a state-of-the-art Large Language Model (LLM) to first create the error and secondly the repairing utterance. We perform Language Model-based evaluation to ensure the quality of the generated utterances. We apply the method to the MultiWOZ dataset and evaluate it both qualitatively and empirically as well as with human judges. Our results indicate that current LLMs can aid in adding post-hoc miscommunications to benchmark datasets as a form of data augmentation. We publish the resulting dataset, in which nearly 1900 dialogues have been modified, as CoPrUS-MultiWOZ to facilitate future work on dialogue systems.

CoPrUS: Consistency Preserving Utterance Synthesis towards more realistic benchmark dialogues

TL;DR

This work targets realism gaps in task-oriented dialogue benchmarks by injecting post-hoc miscommunications into Wizard-of-Oz dialogues. It introduces CoPrUS, a two-step prompting pipeline that uses a large language model to generate miscommunication utterances (MU, NU, VQ) and corresponding repairing utterances, with Prometheus 2-based automatic quality assurance. The authors apply CoPrUS to MultiWOZ 2.1, creating the CoPrUS-MultiWOZ dataset (~1900 dialogues) and demonstrating that the augmented data yields comparable or modest improvements across NLG, NLU, and DST tasks without degrading downstream performance. The dataset and methodology enable more realistic benchmarks and pave the way for future research on error handling and recovery in dialogue systems, while acknowledging limitations in generation quality and the need for broader taxonomy and real-world validation.

Abstract

Large-scale Wizard-Of-Oz dialogue datasets have enabled the training of deep learning-based dialogue systems. While they are successful as benchmark datasets, they lack certain types of utterances, which would make them more realistic. In this work, we investigate the creation of synthetic communication errors in an automatic pipeline. Based on linguistic theory, we propose and follow a simple error taxonomy. We focus on three types of miscommunications that could happen in real-world dialogues but are underrepresented in the benchmark dataset: misunderstandings, non-understandings and vaguely related questions. Our two-step approach uses a state-of-the-art Large Language Model (LLM) to first create the error and secondly the repairing utterance. We perform Language Model-based evaluation to ensure the quality of the generated utterances. We apply the method to the MultiWOZ dataset and evaluate it both qualitatively and empirically as well as with human judges. Our results indicate that current LLMs can aid in adding post-hoc miscommunications to benchmark datasets as a form of data augmentation. We publish the resulting dataset, in which nearly 1900 dialogues have been modified, as CoPrUS-MultiWOZ to facilitate future work on dialogue systems.

Paper Structure

This paper contains 20 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Example MultiWOZ dialogue after application of CoPrUS. Highlighted red is the miscommunication utterance, in this case, a vaguely related question, and in blue the repairing attempt by the system.
  • Figure 2: Two examples of miscommunication and repairing utterances for each type.
  • Figure 3: Prompting procedure with shortened prompts. Full prompts are shown in the appendix.
  • Figure 4: Overview of the CoPrUS workflow.
  • Figure 5: The full prompts used for the Llama model.
  • ...and 1 more figures