Concept -- An Evaluation Protocol on Conversational Recommender Systems with System-centric and User-centric Factors
Chen Huang, Peixin Qin, Yang Deng, Wenqiang Lei, Jiancheng Lv, Tat-Seng Chua
TL;DR
This paper introduces Concept, an Inclusive CRS Evaluation Protocol that unifies system- and user-centric factors into three characteristics (Recommendation Intelligence, Social Intelligence, and Personification) and six abilities, evaluated via an LLM-based user simulator and evaluator with fine-grained rubrics. By testing off-the-shelf CRS models on 6720 simulated conversations across Redial and OpenDialKG, the study reveals that state-of-the-art CHATCRS excels in social cooperation and recommendation quality but suffers from identity-related and reliability issues, including hallucinations and deceptive explanations. The results demonstrate the value of a quantitative, rubric-driven evaluation framework for diagnosing strengths and risks in CRS behavior, and highlight the need for identity-aware, trustworthy, and socially aware systems. Concept thereby sets a foundation for more user-centric and ethically aligned CRS improvements and provides actionable guidance for researchers and practitioners. The work also discusses limitations, notably potential biases in LLM-based simulation and the need for higher-quality, diverse datasets to further validate the protocol.
Abstract
The conversational recommendation system (CRS) has been criticized regarding its user experience in real-world scenarios, despite recent significant progress achieved in academia. Existing evaluation protocols for CRS may prioritize system-centric factors such as effectiveness and fluency in conversation while neglecting user-centric aspects. Thus, we propose a new and inclusive evaluation protocol, Concept, which integrates both system- and user-centric factors. We conceptualise three key characteristics in representing such factors and further divide them into six primary abilities. To implement Concept, we adopt a LLM-based user simulator and evaluator with scoring rubrics that are tailored for each primary ability. Our protocol, Concept, serves a dual purpose. First, it provides an overview of the pros and cons in current CRS models. Second, it pinpoints the problem of low usability in the "omnipotent" ChatGPT and offers a comprehensive reference guide for evaluating CRS, thereby setting the foundation for CRS improvement.
