Table of Contents
Fetching ...

Inconsistent dialogue responses and how to recover from them

Mian Zhang, Lifeng Jin, Linfeng Song, Haitao Mi, Dong Yu

TL;DR

This work introduces CIDER, a large human-authored dataset of 27,180 dialogues designed to study inconsistency in dialogue, capturing the full life cycle from introduction to resolution through inconsistent continuations, explanations, and clarifying responses. It formalizes two consistency-checking tasks (Pair-Check and Diag-Check) and two consistency-resolution tasks (Pair-Resolve and Diag-Resolve), and evaluates supervised checkers and resolvers against LLMs on these tasks. Experiments show that CIDER provides robust supervision for detecting inconsistencies and that while large language models excel at resolving inconsistencies, they lag behind specialized models in detection, highlighting a gap between recognizing and remedying inconsistencies. The paper also analyzes data ablations and the impact of extra data, demonstrates transferability across domains, and offers insights into prompt-based LLM performance, establishing a foundation for more reliable, self-consistent dialogue systems with practical implications for building chatbots with verifiable consistency.

Abstract

One critical issue for chat systems is to stay consistent about preferences, opinions, beliefs and facts of itself, which has been shown a difficult problem. In this work, we study methods to assess and bolster utterance consistency of chat systems. A dataset is first developed for studying the inconsistencies, where inconsistent dialogue responses, explanations of the inconsistencies, and recovery utterances are authored by annotators. This covers the life span of inconsistencies, namely introduction, understanding, and resolution. Building on this, we introduce a set of tasks centered on dialogue consistency, specifically focused on its detection and resolution. Our experimental findings indicate that our dataset significantly helps the progress in identifying and resolving conversational inconsistencies, and current popular large language models like ChatGPT which are good at resolving inconsistencies however still struggle with detection.

Inconsistent dialogue responses and how to recover from them

TL;DR

This work introduces CIDER, a large human-authored dataset of 27,180 dialogues designed to study inconsistency in dialogue, capturing the full life cycle from introduction to resolution through inconsistent continuations, explanations, and clarifying responses. It formalizes two consistency-checking tasks (Pair-Check and Diag-Check) and two consistency-resolution tasks (Pair-Resolve and Diag-Resolve), and evaluates supervised checkers and resolvers against LLMs on these tasks. Experiments show that CIDER provides robust supervision for detecting inconsistencies and that while large language models excel at resolving inconsistencies, they lag behind specialized models in detection, highlighting a gap between recognizing and remedying inconsistencies. The paper also analyzes data ablations and the impact of extra data, demonstrates transferability across domains, and offers insights into prompt-based LLM performance, establishing a foundation for more reliable, self-consistent dialogue systems with practical implications for building chatbots with verifiable consistency.

Abstract

One critical issue for chat systems is to stay consistent about preferences, opinions, beliefs and facts of itself, which has been shown a difficult problem. In this work, we study methods to assess and bolster utterance consistency of chat systems. A dataset is first developed for studying the inconsistencies, where inconsistent dialogue responses, explanations of the inconsistencies, and recovery utterances are authored by annotators. This covers the life span of inconsistencies, namely introduction, understanding, and resolution. Building on this, we introduce a set of tasks centered on dialogue consistency, specifically focused on its detection and resolution. Our experimental findings indicate that our dataset significantly helps the progress in identifying and resolving conversational inconsistencies, and current popular large language models like ChatGPT which are good at resolving inconsistencies however still struggle with detection.
Paper Structure (22 sections, 3 figures, 7 tables)

This paper contains 22 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: An instance in CIDER dataset. {A, B}$x$ denotes the $x$-th utterance of one of the two speakers (A or B). An inconsistent utterance (A2 in this case), an explanation of the inconsistency (the dotted box), and a clarification response (B2 in this case) are written for each dialogue.
  • Figure 2: Prompts of checking tasks.
  • Figure 3: Prompts of resolving tasks.