Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation

Se-eun Yoon; Zhankui He; Jessica Maria Echterhoff; Julian McAuley

Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation

Se-eun Yoon, Zhankui He, Jessica Maria Echterhoff, Julian McAuley

TL;DR

A new protocol to measure the degree to which language models can accurately emulate human behavior in conversational recommendation is introduced, comprised of five tasks designed to evaluate a key property that a synthetic user should exhibit.

Abstract

Synthetic users are cost-effective proxies for real users in the evaluation of conversational recommender systems. Large language models show promise in simulating human-like behavior, raising the question of their ability to represent a diverse population of users. We introduce a new protocol to measure the degree to which language models can accurately emulate human behavior in conversational recommendation. This protocol is comprised of five tasks, each designed to evaluate a key property that a synthetic user should exhibit: choosing which items to talk about, expressing binary preferences, expressing open-ended preferences, requesting recommendations, and giving feedback. Through evaluation of baseline simulators, we demonstrate these tasks effectively reveal deviations of language models from human behavior, and offer insights on how to reduce the deviations with model selection and prompting strategies.

Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation

TL;DR

Abstract

Paper Structure (18 sections, 1 equation, 14 figures, 8 tables)

This paper contains 18 sections, 1 equation, 14 figures, 8 tables.

Introduction
Evaluation Tasks
Methods
Experiments
Related Work
Conclusion
Appendix
Dataset statistics
Prompts
ItemsTalk
BinPref
OpenPref
RecRequest
Feedback
More results
...and 3 more sections

Figures (14)

Figure 1: To be successful user simulators for conversational recommendation, representing a population of users, LLMs must fulfill a variety of tasks.
Figure 2: Distribution of mentioned items (Reddit+IH). Items are sorted in descending frequency. Humans mention more diverse items (left) than simulators (right).
Figure 3: How well do simulators reflect human preferences? Most fail, except gpt-4 with pickiness (bottom right). The units for ratings and positive rates are different but included in the same plot to compare trends.
Figure 4: Sentiments in open-ended responses.
Figure 5: Diversity of requests per entropy level. Simulator requests are less diverse across all entropy levels.
...and 9 more figures

Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation

TL;DR

Abstract

Evaluating Large Language Models as Generative User Simulators for Conversational Recommendation

Authors

TL;DR

Abstract

Table of Contents

Figures (14)