Identifying Breakdowns in Conversational Recommender Systems using User Simulation

Nolwenn Bernard; Krisztian Balog

Identifying Breakdowns in Conversational Recommender Systems using User Simulation

Nolwenn Bernard, Krisztian Balog

TL;DR

This work tackles the robustness gap in conversational recommender systems (CRSs) by proposing a simulator-driven methodology to identify conversational breakdowns. The approach defines breakdown types, detectors, and a four-step workflow that analyzes $N$ CRS–user-simulator conversations to reveal problematic dialogue paths, enabling iterative CRS improvements. The authors demonstrate the method in a case study with IAI MovieBot and a user simulator, showing that targeted modifications can eliminate system failures and reduce other breakdown types, while also acknowledging that the user simulator itself can introduce breakdowns. The methodology is architecture-agnostic and serves as both a diagnostic tool and a development workflow for strengthening CRS robustness and evaluability.

Abstract

We present a methodology to systematically test conversational recommender systems with regards to conversational breakdowns. It involves examining conversations generated between the system and simulated users for a set of pre-defined breakdown types, extracting responsible conversational paths, and characterizing them in terms of the underlying dialogue intents. User simulation offers the advantages of simplicity, cost-effectiveness, and time efficiency for obtaining conversations where potential breakdowns can be identified. The proposed methodology can be used as diagnostic tool as well as a development tool to improve conversational recommendation systems. We apply our methodology in a case study with an existing conversational recommender system and user simulator, demonstrating that with just a few iterations, we can make the system more robust to conversational breakdowns.

Identifying Breakdowns in Conversational Recommender Systems using User Simulation

TL;DR

CRS–user-simulator conversations to reveal problematic dialogue paths, enabling iterative CRS improvements. The authors demonstrate the method in a case study with IAI MovieBot and a user simulator, showing that targeted modifications can eliminate system failures and reduce other breakdown types, while also acknowledging that the user simulator itself can introduce breakdowns. The methodology is architecture-agnostic and serves as both a diagnostic tool and a development workflow for strengthening CRS robustness and evaluability.

Abstract

Paper Structure (13 sections, 4 figures, 2 tables, 2 algorithms)

This paper contains 13 sections, 4 figures, 2 tables, 2 algorithms.

Introduction
Related Work
Detecting Conversational Breakdowns Using User Simulation
Problem Statement
Methodology
Conversational Breakdown Detection
Operationalization
System failure
Dialogue of the deaf
Conversation flow discontinuation
Case Study
Discussion
Conclusion

Figures (4)

Figure 1: Flowchart of the proposed methodology, with dashed arrows denoting transitions when the methodology is used as a development tool.
Figure 3: Simplified interaction model for conversational recommendation (inspired by Habib:2020:CIKM). The blue and green states represent the agent and user, respectively.
Figure 4: Architecture of dialogue participants: IAI MovieBot and UserSimCRS.
Figure 5: Conversational breakdowns per type for each iteration (groups). $B_1$, $B_2$, and $B_3$ represent system failure, dialogue of the deaf, and flow discontinuation, respectively. Colors indicate whether the breakdowns can be attributed to the CRS (purple), the US (green), or cannot be attributed (blue). Our aim in this work is to decrease the number of breakdowns especially those attributed to the CRS. The arrows between the iterations indicate which participant (CRS/US) was modified and what breakdown was targeted.

Identifying Breakdowns in Conversational Recommender Systems using User Simulation

TL;DR

Abstract

Identifying Breakdowns in Conversational Recommender Systems using User Simulation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)