Table of Contents
Fetching ...

Towards Automatic Evaluation of Task-Oriented Dialogue Flows

Mehrnoosh Mirtaheri, Nikhil Varghese, Chandra Khatri, Amol Kelkar

TL;DR

This work introduces FuDGE (Fuzzy Dialogue-Graph Edit Distance), a novel metric that evaluates dialogue flows by assessing their structural complexity and representational coverage of the conversation data and demonstrates the effectiveness of FuDGE and its evaluation framework.

Abstract

Task-oriented dialogue systems rely on predefined conversation schemes (dialogue flows) often represented as directed acyclic graphs. These flows can be manually designed or automatically generated from previously recorded conversations. Due to variations in domain expertise or reliance on different sets of prior conversations, these dialogue flows can manifest in significantly different graph structures. Despite their importance, there is no standard method for evaluating the quality of dialogue flows. We introduce FuDGE (Fuzzy Dialogue-Graph Edit Distance), a novel metric that evaluates dialogue flows by assessing their structural complexity and representational coverage of the conversation data. FuDGE measures how well individual conversations align with a flow and, consequently, how well a set of conversations is represented by the flow overall. Through extensive experiments on manually configured flows and flows generated by automated techniques, we demonstrate the effectiveness of FuDGE and its evaluation framework. By standardizing and optimizing dialogue flows, FuDGE enables conversational designers and automated techniques to achieve higher levels of efficiency and automation.

Towards Automatic Evaluation of Task-Oriented Dialogue Flows

TL;DR

This work introduces FuDGE (Fuzzy Dialogue-Graph Edit Distance), a novel metric that evaluates dialogue flows by assessing their structural complexity and representational coverage of the conversation data and demonstrates the effectiveness of FuDGE and its evaluation framework.

Abstract

Task-oriented dialogue systems rely on predefined conversation schemes (dialogue flows) often represented as directed acyclic graphs. These flows can be manually designed or automatically generated from previously recorded conversations. Due to variations in domain expertise or reliance on different sets of prior conversations, these dialogue flows can manifest in significantly different graph structures. Despite their importance, there is no standard method for evaluating the quality of dialogue flows. We introduce FuDGE (Fuzzy Dialogue-Graph Edit Distance), a novel metric that evaluates dialogue flows by assessing their structural complexity and representational coverage of the conversation data. FuDGE measures how well individual conversations align with a flow and, consequently, how well a set of conversations is represented by the flow overall. Through extensive experiments on manually configured flows and flows generated by automated techniques, we demonstrate the effectiveness of FuDGE and its evaluation framework. By standardizing and optimizing dialogue flows, FuDGE enables conversational designers and automated techniques to achieve higher levels of efficiency and automation.

Paper Structure

This paper contains 25 sections, 12 equations, 3 figures, 4 tables, 2 algorithms.

Figures (3)

  • Figure 1: An illustration of a dialogue flow (left) that is used to configure a task-oriented dialogue agent for a fictitious company named "Everyclothing." The chat widget (right) illustrates a customer asking about canceling their order.
  • Figure 2: Efficient memoization used for FuDGE
  • Figure 3: Parameter tuning with FF1 for ALG2 and Make Payment task. The left column is the scores from an unsupervised discovered flow, and the right column corresponds to the supervised flow. The optimal $k$ is smaller for the supervised flow, indicating a better compression.