UserSimCRS v2: Simulation-Based Evaluation for Conversational Recommender Systems
Nolwenn Bernard, Krisztian Balog
TL;DR
This paper tackles the challenge of evaluating conversational recommender systems with scalable, reproducible methods. It introduces UserSimCRS v2, a significantly upgraded framework that combines an enhanced agenda-based user simulator with two LLM-based simulators, unified data formats, broader CRS integration, and an LLM-based evaluation utility. The approach supports multiple benchmark datasets (e.g., ReDial, INSPIRED, IARD) and CRSs via a CRS Arena interface, demonstrated through a movie recommendation case study that reveals substantive variability across simulators and datasets. These contributions lower barriers to simulation-based evaluation and enable more robust, multifaceted assessment of CRSs and user models, paving the way for richer benchmarking and research directions in user simulation.
Abstract
Resources for simulation-based evaluation of conversational recommender systems (CRSs) are scarce. The UserSimCRS toolkit was introduced to address this gap. In this work, we present UserSimCRS v2, a significant upgrade aligning the toolkit with state-of-the-art research. Key extensions include an enhanced agenda-based user simulator, introduction of large language model-based simulators, integration for a wider range of CRSs and datasets, and new LLM-as-a-judge evaluation utilities. We demonstrate these extensions in a case study.
