Table of Contents
Fetching ...

Evaluating Conversational Recommender Systems via Large Language Models: A User-Centric Framework

Nuo Chen, Quanyu Dai, Xiaoyu Dong, Piaohong Wang, Qinglin Jia, Zhaocheng Du, Zhenhua Dong, Xiao-Ming Wu

TL;DR

This work tackles the challenge of evaluating conversational recommender systems in a holistic, user-centered way. It introduces the Conversational Recommendation Evaluator (CoRE), a two‑part framework that uses LLMs as per‑factor evaluators for twelve user experience factors and a four-agent debate to produce an overall score on a 0–100 scale. Through experiments on ReDial and OpenDialKG with four CRS models, CoRE demonstrates substantial alignment with human judgments and outperforms traditional rule‑based metrics in capturing overall CRS quality, while also revealing factors that LLMs may couple and areas where evaluators diverge. The study lays groundwork for scalable, human-aligned CRS evaluation and highlights future directions for improving evaluator diversity, reducing hallucinations, and generalizing to different domains.

Abstract

Conversational recommender systems (CRSs) integrate both recommendation and dialogue tasks, making their evaluation uniquely challenging. Existing approaches primarily assess CRS performance by separately evaluating item recommendation and dialogue management using rule-based metrics. However, these methods fail to capture the real human experience, and they cannot draw direct conclusions about the system's overall performance. As conversational recommender systems become increasingly vital in e-commerce, social media, and customer support, the ability to evaluate both recommendation accuracy and dialogue management quality using a single metric, thereby authentically reflecting user experience, has become the principal challenge impeding progress in this field. In this work, we propose a user-centric evaluation framework based on large language models (LLMs) for CRSs, namely Conversational Recommendation Evaluator (CoRE). CoRE consists of two main components: (1) LLM-As-Evaluator. Firstly, we comprehensively summarize 12 key factors influencing user experience in CRSs and directly leverage LLM as an evaluator to assign a score to each factor. (2) Multi-Agent Debater. Secondly, we design a multi-agent debate framework with four distinct roles (common user, domain expert, linguist, and HCI expert) to discuss and synthesize the 12 evaluation factors into a unified overall performance score. Furthermore, we apply the proposed framework to evaluate four CRSs on two benchmark datasets. The experimental results show that CoRE aligns well with human evaluation in most of the 12 factors and the overall assessment. Especially, CoRE's overall evaluation scores demonstrate significantly better alignment with human feedback compared to existing rule-based metrics.

Evaluating Conversational Recommender Systems via Large Language Models: A User-Centric Framework

TL;DR

This work tackles the challenge of evaluating conversational recommender systems in a holistic, user-centered way. It introduces the Conversational Recommendation Evaluator (CoRE), a two‑part framework that uses LLMs as per‑factor evaluators for twelve user experience factors and a four-agent debate to produce an overall score on a 0–100 scale. Through experiments on ReDial and OpenDialKG with four CRS models, CoRE demonstrates substantial alignment with human judgments and outperforms traditional rule‑based metrics in capturing overall CRS quality, while also revealing factors that LLMs may couple and areas where evaluators diverge. The study lays groundwork for scalable, human-aligned CRS evaluation and highlights future directions for improving evaluator diversity, reducing hallucinations, and generalizing to different domains.

Abstract

Conversational recommender systems (CRSs) integrate both recommendation and dialogue tasks, making their evaluation uniquely challenging. Existing approaches primarily assess CRS performance by separately evaluating item recommendation and dialogue management using rule-based metrics. However, these methods fail to capture the real human experience, and they cannot draw direct conclusions about the system's overall performance. As conversational recommender systems become increasingly vital in e-commerce, social media, and customer support, the ability to evaluate both recommendation accuracy and dialogue management quality using a single metric, thereby authentically reflecting user experience, has become the principal challenge impeding progress in this field. In this work, we propose a user-centric evaluation framework based on large language models (LLMs) for CRSs, namely Conversational Recommendation Evaluator (CoRE). CoRE consists of two main components: (1) LLM-As-Evaluator. Firstly, we comprehensively summarize 12 key factors influencing user experience in CRSs and directly leverage LLM as an evaluator to assign a score to each factor. (2) Multi-Agent Debater. Secondly, we design a multi-agent debate framework with four distinct roles (common user, domain expert, linguist, and HCI expert) to discuss and synthesize the 12 evaluation factors into a unified overall performance score. Furthermore, we apply the proposed framework to evaluate four CRSs on two benchmark datasets. The experimental results show that CoRE aligns well with human evaluation in most of the 12 factors and the overall assessment. Especially, CoRE's overall evaluation scores demonstrate significantly better alignment with human feedback compared to existing rule-based metrics.
Paper Structure (54 sections, 8 equations, 15 figures, 38 tables)

This paper contains 54 sections, 8 equations, 15 figures, 38 tables.

Figures (15)

  • Figure 1: The overview of our method. We evaluate user interactions with conversational systems through a large language model across 12 factors. The scores and justifications for the twelve factors are used as input to the Multi-Agent Debater, where each agent receives inputs corresponding to the specific factors associated with its designated role. After several rounds of debate, the agents produce a final overall score.
  • Figure 2: An example of a user (U) interacting with a chit‑chat–style conversational recommender system (S). The CRS uses natural language to suggest movies to the user, while also presenting a separate list of recommended items at each term. The items in that list don’t necessarily match exactly those mentioned in the conversation.
  • Figure 3: The first part of our proposed evaluation framework encompasses four dimensions, covering a total of twelve factors.
  • Figure 4: The interaction process between the user simulator and the CRS.
  • Figure 5: Turn-wise evolution of agreement ratios for four conversational recommendation models. Each subfigure shows the proportion of All Agree (all four evaluators assign the same score) and Major Agree (at least three evaluators assign the same score) judgments as a function of the number of turns.
  • ...and 10 more figures