"Stupid robot, I want to speak to a human!" User Frustration Detection in Task-Oriented Dialog Systems
Mireia Hernandez Caralt, Ivan Sekulić, Filip Carević, Nghia Khau, Diana Nicoleta Popa, Bruna Guedes, Victor Guimarães, Zeyu Yang, Andre Manso, Meghana Reddy, Paolo Rosso, Roland Mathis
TL;DR
The paper tackles the practical problem of detecting user frustration in deployed task-oriented dialog systems, addressing a gap between academic emotion/sentiment studies and real-world data. It compares a deployed rule-based method, open-source sentiment/emotion detectors, a dialog-breakdown detector, and a novel in-context learning (ICL) approach using large language models, all evaluated on a real-world TOD corpus. The results show open-source methods underperform in production data, while LLM-based ICL yields the strongest performance, achieving a notable relative improvement in $F_1$ on an internal benchmark. The work highlights important industry implications, differences between academic and real-world data, and suggests promising avenues such as multi-modal signals and user profiling to further enhance frustration detection in TOD systems.
Abstract
Detecting user frustration in modern-day task-oriented dialog (TOD) systems is imperative for maintaining overall user satisfaction, engagement, and retention. However, most recent research is focused on sentiment and emotion detection in academic settings, thus failing to fully encapsulate implications of real-world user data. To mitigate this gap, in this work, we focus on user frustration in a deployed TOD system, assessing the feasibility of out-of-the-box solutions for user frustration detection. Specifically, we compare the performance of our deployed keyword-based approach, open-source approaches to sentiment analysis, dialog breakdown detection methods, and emerging in-context learning LLM-based detection. Our analysis highlights the limitations of open-source methods for real-world frustration detection, while demonstrating the superior performance of the LLM-based approach, achieving a 16\% relative improvement in F1 score on an internal benchmark. Finally, we analyze advantages and limitations of our methods and provide an insight into user frustration detection task for industry practitioners.
