Table of Contents
Fetching ...

"Stupid robot, I want to speak to a human!" User Frustration Detection in Task-Oriented Dialog Systems

Mireia Hernandez Caralt, Ivan Sekulić, Filip Carević, Nghia Khau, Diana Nicoleta Popa, Bruna Guedes, Victor Guimarães, Zeyu Yang, Andre Manso, Meghana Reddy, Paolo Rosso, Roland Mathis

TL;DR

The paper tackles the practical problem of detecting user frustration in deployed task-oriented dialog systems, addressing a gap between academic emotion/sentiment studies and real-world data. It compares a deployed rule-based method, open-source sentiment/emotion detectors, a dialog-breakdown detector, and a novel in-context learning (ICL) approach using large language models, all evaluated on a real-world TOD corpus. The results show open-source methods underperform in production data, while LLM-based ICL yields the strongest performance, achieving a notable relative improvement in $F_1$ on an internal benchmark. The work highlights important industry implications, differences between academic and real-world data, and suggests promising avenues such as multi-modal signals and user profiling to further enhance frustration detection in TOD systems.

Abstract

Detecting user frustration in modern-day task-oriented dialog (TOD) systems is imperative for maintaining overall user satisfaction, engagement, and retention. However, most recent research is focused on sentiment and emotion detection in academic settings, thus failing to fully encapsulate implications of real-world user data. To mitigate this gap, in this work, we focus on user frustration in a deployed TOD system, assessing the feasibility of out-of-the-box solutions for user frustration detection. Specifically, we compare the performance of our deployed keyword-based approach, open-source approaches to sentiment analysis, dialog breakdown detection methods, and emerging in-context learning LLM-based detection. Our analysis highlights the limitations of open-source methods for real-world frustration detection, while demonstrating the superior performance of the LLM-based approach, achieving a 16\% relative improvement in F1 score on an internal benchmark. Finally, we analyze advantages and limitations of our methods and provide an insight into user frustration detection task for industry practitioners.

"Stupid robot, I want to speak to a human!" User Frustration Detection in Task-Oriented Dialog Systems

TL;DR

The paper tackles the practical problem of detecting user frustration in deployed task-oriented dialog systems, addressing a gap between academic emotion/sentiment studies and real-world data. It compares a deployed rule-based method, open-source sentiment/emotion detectors, a dialog-breakdown detector, and a novel in-context learning (ICL) approach using large language models, all evaluated on a real-world TOD corpus. The results show open-source methods underperform in production data, while LLM-based ICL yields the strongest performance, achieving a notable relative improvement in on an internal benchmark. The work highlights important industry implications, differences between academic and real-world data, and suggests promising avenues such as multi-modal signals and user profiling to further enhance frustration detection in TOD systems.

Abstract

Detecting user frustration in modern-day task-oriented dialog (TOD) systems is imperative for maintaining overall user satisfaction, engagement, and retention. However, most recent research is focused on sentiment and emotion detection in academic settings, thus failing to fully encapsulate implications of real-world user data. To mitigate this gap, in this work, we focus on user frustration in a deployed TOD system, assessing the feasibility of out-of-the-box solutions for user frustration detection. Specifically, we compare the performance of our deployed keyword-based approach, open-source approaches to sentiment analysis, dialog breakdown detection methods, and emerging in-context learning LLM-based detection. Our analysis highlights the limitations of open-source methods for real-world frustration detection, while demonstrating the superior performance of the LLM-based approach, achieving a 16\% relative improvement in F1 score on an internal benchmark. Finally, we analyze advantages and limitations of our methods and provide an insight into user frustration detection task for industry practitioners.

Paper Structure

This paper contains 19 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Example of user frustration in a deployed TOD system. The user can only come after 6PM due to work, but the system misses this and suggests the next available slot. Traditional sentiment models often fail to detect such nuances, as there is no explicit mention of negative sentiment.
  • Figure 2: a) Negative sentiment prediction influenced by the apologetic behavior of the system b) Frustration caused by the system's failure to transfer the user to live agent. Yellow exclamation sign indicates that the example has been correctly classified by the FC-based sentiment approach only. c) Non-frustrated repetition of rejections in the process of time slot negotiations. Yellow exclamation sign indicates that the example has been correctly classified by LU-based and incorrectly by FC-based sentiment/emotion detection methods.
  • Figure 3: In-Context Learning Prompt for User Frustration Detection in Task-Oriented Dialog Systems. The context is comprised of the description of the task ($T$), the domain of the conversation ($D$) and the conversation history ($H$). Our prompt also includes output instructions to generate binary user frustration labels.