Evaluating Conversational Recommender Systems via Large Language Models: A User-Centric Framework
Nuo Chen, Quanyu Dai, Xiaoyu Dong, Piaohong Wang, Qinglin Jia, Zhaocheng Du, Zhenhua Dong, Xiao-Ming Wu
TL;DR
This work tackles the challenge of evaluating conversational recommender systems in a holistic, user-centered way. It introduces the Conversational Recommendation Evaluator (CoRE), a two‑part framework that uses LLMs as per‑factor evaluators for twelve user experience factors and a four-agent debate to produce an overall score on a 0–100 scale. Through experiments on ReDial and OpenDialKG with four CRS models, CoRE demonstrates substantial alignment with human judgments and outperforms traditional rule‑based metrics in capturing overall CRS quality, while also revealing factors that LLMs may couple and areas where evaluators diverge. The study lays groundwork for scalable, human-aligned CRS evaluation and highlights future directions for improving evaluator diversity, reducing hallucinations, and generalizing to different domains.
Abstract
Conversational recommender systems (CRSs) integrate both recommendation and dialogue tasks, making their evaluation uniquely challenging. Existing approaches primarily assess CRS performance by separately evaluating item recommendation and dialogue management using rule-based metrics. However, these methods fail to capture the real human experience, and they cannot draw direct conclusions about the system's overall performance. As conversational recommender systems become increasingly vital in e-commerce, social media, and customer support, the ability to evaluate both recommendation accuracy and dialogue management quality using a single metric, thereby authentically reflecting user experience, has become the principal challenge impeding progress in this field. In this work, we propose a user-centric evaluation framework based on large language models (LLMs) for CRSs, namely Conversational Recommendation Evaluator (CoRE). CoRE consists of two main components: (1) LLM-As-Evaluator. Firstly, we comprehensively summarize 12 key factors influencing user experience in CRSs and directly leverage LLM as an evaluator to assign a score to each factor. (2) Multi-Agent Debater. Secondly, we design a multi-agent debate framework with four distinct roles (common user, domain expert, linguist, and HCI expert) to discuss and synthesize the 12 evaluation factors into a unified overall performance score. Furthermore, we apply the proposed framework to evaluate four CRSs on two benchmark datasets. The experimental results show that CoRE aligns well with human evaluation in most of the 12 factors and the overall assessment. Especially, CoRE's overall evaluation scores demonstrate significantly better alignment with human feedback compared to existing rule-based metrics.
