Table of Contents
Fetching ...

Evaluating Task-Oriented Dialogue Consistency through Constraint Satisfaction

Tiziano Labruna, Bernardo Magnini

TL;DR

The paper addresses the problem of ensuring task-oriented dialogue (TOD) consistency by modeling it as a Constraint Satisfaction Problem (CSP), where dialogue segments referencing the domain are variables and linguistic, dialogic, and domain constraints define the admissible space. A CSP solver is used to detect inconsistencies and to evaluate LLM-generated re-lexicalizations (via GPT-3.5-Turbo) against the CSP-computed feasible solutions, revealing an overall re-lexicalization accuracy of about $0.15$ and highlighting domain-knowledge constraints as the most challenging to satisfy. An ablation study shows that removing domain constraints substantially improves performance, underscoring the gap between current LLM capabilities and domain-grounded generation. The proposed CSP-based evaluation framework offers a reusable, principled approach for diagnosing and improving TOD consistency, with tangible implications for evaluating and guiding future dialogue systems that rely on external knowledge bases.

Abstract

Task-oriented dialogues must maintain consistency both within the dialogue itself, ensuring logical coherence across turns, and with the conversational domain, accurately reflecting external knowledge. We propose to conceptualize dialogue consistency as a Constraint Satisfaction Problem (CSP), wherein variables represent segments of the dialogue referencing the conversational domain, and constraints among variables reflect dialogue properties, including linguistic, conversational, and domain-based aspects. To demonstrate the feasibility of the approach, we utilize a CSP solver to detect inconsistencies in dialogues re-lexicalized by an LLM. Our findings indicate that: (i) CSP is effective to detect dialogue inconsistencies; and (ii) consistent dialogue re-lexicalization is challenging for state-of-the-art LLMs, achieving only a 0.15 accuracy rate when compared to a CSP solver. Furthermore, through an ablation study, we reveal that constraints derived from domain knowledge pose the greatest difficulty in being respected. We argue that CSP captures core properties of dialogue consistency that have been poorly considered by approaches based on component pipelines.

Evaluating Task-Oriented Dialogue Consistency through Constraint Satisfaction

TL;DR

The paper addresses the problem of ensuring task-oriented dialogue (TOD) consistency by modeling it as a Constraint Satisfaction Problem (CSP), where dialogue segments referencing the domain are variables and linguistic, dialogic, and domain constraints define the admissible space. A CSP solver is used to detect inconsistencies and to evaluate LLM-generated re-lexicalizations (via GPT-3.5-Turbo) against the CSP-computed feasible solutions, revealing an overall re-lexicalization accuracy of about and highlighting domain-knowledge constraints as the most challenging to satisfy. An ablation study shows that removing domain constraints substantially improves performance, underscoring the gap between current LLM capabilities and domain-grounded generation. The proposed CSP-based evaluation framework offers a reusable, principled approach for diagnosing and improving TOD consistency, with tangible implications for evaluating and guiding future dialogue systems that rely on external knowledge bases.

Abstract

Task-oriented dialogues must maintain consistency both within the dialogue itself, ensuring logical coherence across turns, and with the conversational domain, accurately reflecting external knowledge. We propose to conceptualize dialogue consistency as a Constraint Satisfaction Problem (CSP), wherein variables represent segments of the dialogue referencing the conversational domain, and constraints among variables reflect dialogue properties, including linguistic, conversational, and domain-based aspects. To demonstrate the feasibility of the approach, we utilize a CSP solver to detect inconsistencies in dialogues re-lexicalized by an LLM. Our findings indicate that: (i) CSP is effective to detect dialogue inconsistencies; and (ii) consistent dialogue re-lexicalization is challenging for state-of-the-art LLMs, achieving only a 0.15 accuracy rate when compared to a CSP solver. Furthermore, through an ablation study, we reveal that constraints derived from domain knowledge pose the greatest difficulty in being respected. We argue that CSP captures core properties of dialogue consistency that have been poorly considered by approaches based on component pipelines.
Paper Structure (29 sections, 10 equations, 2 figures, 5 tables)

This paper contains 29 sections, 10 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: An inconsistent task-oriented dialogue with a Knowledge Base. Red values indicate internal inconsistencies, purple values indicate external inconsistencies.
  • Figure 2: Overview of the CSP-based methodology applied to TOD consistency.