Table of Contents
Fetching ...

Actionable Conversational Quality Indicators for Improving Task-Oriented Dialog Systems

Michael Higgins, Dominic Widdows, Chris Brew, Gwen Christian, Andrew Maurer, Matthew Dunn, Sujit Mathi, Akshay Hazare, George Bonev, Beth Ann Hockey, Kristen Howell, Joe Bradley

TL;DR

The paper introduces Actionable Conversational Quality Indicators (ACQIs) that, together with a running Interaction Quality (IQ) score, identify and prescribe fixes for failure points in task-oriented dialogs. Using LEGOv2 and LivePerson datasets, it develops a taxonomy of ACQIs linked to concrete remediation actions and demonstrates that a text-based predictive model can label ACQIs with a weighted F1 of 0.79 and IQ with around 60% accuracy. The combined ACQI+IQ framework reveals where improvements yield the most impact on user experience and suggests substantial reductions in the number of tuning actions required for bot-builders. Overall, the work presents a data-driven, actionable approach for diagnosing and prioritizing improvements in commercial dialog systems, with strong implications for tool design and future research on CX optimization.

Abstract

Automatic dialog systems have become a mainstream part of online customer service. Many such systems are built, maintained, and improved by customer service specialists, rather than dialog systems engineers and computer programmers. As conversations between people and machines become commonplace, it is critical to understand what is working, what is not, and what actions can be taken to reduce the frequency of inappropriate system responses. These analyses and recommendations need to be presented in terms that directly reflect the user experience rather than the internal dialog processing. This paper introduces and explains the use of Actionable Conversational Quality Indicators (ACQIs), which are used both to recognize parts of dialogs that can be improved, and to recommend how to improve them. This combines benefits of previous approaches, some of which have focused on producing dialog quality scoring while others have sought to categorize the types of errors the dialog system is making. We demonstrate the effectiveness of using ACQIs on LivePerson internal dialog systems used in commercial customer service applications, and on the publicly available CMU LEGOv2 conversational dataset (Raux et al. 2005). We report on the annotation and analysis of conversational datasets showing which ACQIs are important to fix in various situations. The annotated datasets are then used to build a predictive model which uses a turn-based vector embedding of the message texts and achieves an 79% weighted average f1-measure at the task of finding the correct ACQI for a given conversation. We predict that if such a model worked perfectly, the range of potential improvement actions a bot-builder must consider at each turn could be reduced by an average of 81%.

Actionable Conversational Quality Indicators for Improving Task-Oriented Dialog Systems

TL;DR

The paper introduces Actionable Conversational Quality Indicators (ACQIs) that, together with a running Interaction Quality (IQ) score, identify and prescribe fixes for failure points in task-oriented dialogs. Using LEGOv2 and LivePerson datasets, it develops a taxonomy of ACQIs linked to concrete remediation actions and demonstrates that a text-based predictive model can label ACQIs with a weighted F1 of 0.79 and IQ with around 60% accuracy. The combined ACQI+IQ framework reveals where improvements yield the most impact on user experience and suggests substantial reductions in the number of tuning actions required for bot-builders. Overall, the work presents a data-driven, actionable approach for diagnosing and prioritizing improvements in commercial dialog systems, with strong implications for tool design and future research on CX optimization.

Abstract

Automatic dialog systems have become a mainstream part of online customer service. Many such systems are built, maintained, and improved by customer service specialists, rather than dialog systems engineers and computer programmers. As conversations between people and machines become commonplace, it is critical to understand what is working, what is not, and what actions can be taken to reduce the frequency of inappropriate system responses. These analyses and recommendations need to be presented in terms that directly reflect the user experience rather than the internal dialog processing. This paper introduces and explains the use of Actionable Conversational Quality Indicators (ACQIs), which are used both to recognize parts of dialogs that can be improved, and to recommend how to improve them. This combines benefits of previous approaches, some of which have focused on producing dialog quality scoring while others have sought to categorize the types of errors the dialog system is making. We demonstrate the effectiveness of using ACQIs on LivePerson internal dialog systems used in commercial customer service applications, and on the publicly available CMU LEGOv2 conversational dataset (Raux et al. 2005). We report on the annotation and analysis of conversational datasets showing which ACQIs are important to fix in various situations. The annotated datasets are then used to build a predictive model which uses a turn-based vector embedding of the message texts and achieves an 79% weighted average f1-measure at the task of finding the correct ACQI for a given conversation. We predict that if such a model worked perfectly, the range of potential improvement actions a bot-builder must consider at each turn could be reduced by an average of 81%.

Paper Structure

This paper contains 25 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: A bot-building interface where the bot-builder is about to add a multiple choice question
  • Figure 2: Change in IQ score distribution between IQ annotations in LEGOv2 and the work reported in this paper.
  • Figure 3: Distribution of negative/neutral/positive score changes grouped by CMU and LivePerson dialog systems.
  • Figure 4: ACQIs from Table \ref{['tab:acqi_taxonomy']} along with the proportions of each that were aligned with positive, negative, and neutral changes in IQ. Note that for the above graphic we excluded any turn whose preceding score IQ was a 1 or 5.
  • Figure 5: Dependence of score change on number of confirmations.