How Reliable is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conversational Recommendation

Lixi Zhu; Xiaowen Huang; Jitao Sang

How Reliable is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conversational Recommendation

Lixi Zhu, Xiaowen Huang, Jitao Sang

TL;DR

This paper investigates the reliability of LLM-based user simulators for conversational recommender systems (CRS) by analytically validating iEvaLM on ReDial and OpenDialKG. It identifies critical issues: data leakage in both conversational history and simulator replies inflates performance; CRS success leans more on historical context than simulator outputs; and controlling a single prompt to steer the simulator is challenging. To address these limitations, the authors propose SimpleUserSim, which confines the simulator to target-item attributes and introduces explicit actions (Chit-chat, Ask, Recommend) guided by prompts, reducing leakage and improving multi-turn CRS performance. The study shows that while LLMs hold promise for CRS, trustworthy and controllable user simulators are essential for realistic evaluation, with SimpleUserSim providing a practical step toward more reliable assessment and stronger interaction-based recommendations.

Abstract

Conversational Recommender System (CRS) interacts with users through natural language to understand their preferences and provide personalized recommendations in real-time. CRS has demonstrated significant potential, prompting researchers to address the development of more realistic and reliable user simulators as a key focus. Recently, the capabilities of Large Language Models (LLMs) have attracted a lot of attention in various fields. Simultaneously, efforts are underway to construct user simulators based on LLMs. While these works showcase innovation, they also come with certain limitations that require attention. In this work, we aim to analyze the limitations of using LLMs in constructing user simulators for CRS, to guide future research. To achieve this goal, we conduct analytical validation on the notable work, iEvaLM. Through multiple experiments on two widely-used datasets in the field of conversational recommendation, we highlight several issues with the current evaluation methods for user simulators based on LLMs: (1) Data leakage, which occurs in conversational history and the user simulator's replies, results in inflated evaluation results. (2) The success of CRS recommendations depends more on the availability and quality of conversational history than on the responses from user simulators. (3) Controlling the output of the user simulator through a single prompt template proves challenging. To overcome these limitations, we propose SimpleUserSim, employing a straightforward strategy to guide the topic toward the target items. Our study validates the ability of CRS models to utilize the interaction information, significantly improving the recommendation results.

How Reliable is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conversational Recommendation

TL;DR

Abstract

Paper Structure (12 sections, 6 figures, 2 tables)

This paper contains 12 sections, 6 figures, 2 tables.

Introduction
LLM as USER SIMULATOR for CRS
Workflow
Experimental setup
Experiment
RQ1: Does the current user simulator, iEvaLM, exhibit data leakage, and if so, in which process? How does the model perform when we ignore these successful recommendations affected by data leakage?
RQ2: How much do successful recommendation conversations depend on user simulator interactions compared to conversational history?
RQ3: Can the user simulator generate responses that meet expectations across various dataset scenarios? If not, why?
The proposed simple user simulator
A very intuitive improvement
Experiment
CONCLUSION AND DISCUSSION

Figures (6)

Figure 1: Workflow of the User Simulator.
Figure 2: Data leakage from conversational history leads to successful recommendations.
Figure 3: Data leakage from user simulator leads to successful recommendations.
Figure 4: Percentage of successful recommendations by turn when using iEvaLM as the user simulator.
Figure 5: The proportion of the CRS's intents during the interaction.
...and 1 more figures

How Reliable is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conversational Recommendation

TL;DR

Abstract

How Reliable is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conversational Recommendation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)