Table of Contents
Fetching ...

UserSimCRS v2: Simulation-Based Evaluation for Conversational Recommender Systems

Nolwenn Bernard, Krisztian Balog

TL;DR

This paper tackles the challenge of evaluating conversational recommender systems with scalable, reproducible methods. It introduces UserSimCRS v2, a significantly upgraded framework that combines an enhanced agenda-based user simulator with two LLM-based simulators, unified data formats, broader CRS integration, and an LLM-based evaluation utility. The approach supports multiple benchmark datasets (e.g., ReDial, INSPIRED, IARD) and CRSs via a CRS Arena interface, demonstrated through a movie recommendation case study that reveals substantive variability across simulators and datasets. These contributions lower barriers to simulation-based evaluation and enable more robust, multifaceted assessment of CRSs and user models, paving the way for richer benchmarking and research directions in user simulation.

Abstract

Resources for simulation-based evaluation of conversational recommender systems (CRSs) are scarce. The UserSimCRS toolkit was introduced to address this gap. In this work, we present UserSimCRS v2, a significant upgrade aligning the toolkit with state-of-the-art research. Key extensions include an enhanced agenda-based user simulator, introduction of large language model-based simulators, integration for a wider range of CRSs and datasets, and new LLM-as-a-judge evaluation utilities. We demonstrate these extensions in a case study.

UserSimCRS v2: Simulation-Based Evaluation for Conversational Recommender Systems

TL;DR

This paper tackles the challenge of evaluating conversational recommender systems with scalable, reproducible methods. It introduces UserSimCRS v2, a significantly upgraded framework that combines an enhanced agenda-based user simulator with two LLM-based simulators, unified data formats, broader CRS integration, and an LLM-based evaluation utility. The approach supports multiple benchmark datasets (e.g., ReDial, INSPIRED, IARD) and CRSs via a CRS Arena interface, demonstrated through a movie recommendation case study that reveals substantive variability across simulators and datasets. These contributions lower barriers to simulation-based evaluation and enable more robust, multifaceted assessment of CRSs and user models, paving the way for richer benchmarking and research directions in user simulation.

Abstract

Resources for simulation-based evaluation of conversational recommender systems (CRSs) are scarce. The UserSimCRS toolkit was introduced to address this gap. In this work, we present UserSimCRS v2, a significant upgrade aligning the toolkit with state-of-the-art research. Key extensions include an enhanced agenda-based user simulator, introduction of large language model-based simulators, integration for a wider range of CRSs and datasets, and new LLM-as-a-judge evaluation utilities. We demonstrate these extensions in a case study.

Paper Structure

This paper contains 22 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Agenda-based (Top) and end-to-end (Bottom) architectures for user simulators. The dashed lines represent optional components and data flows.
  • Figure 2: Overview of UserSimCRS v2 architecture. Grey components are inherited from Afzali:2023:WSDM, while hashed and purple components correspond to updated or added components, respectively.