FACE: A Fine-grained Reference Free Evaluator for Conversational Recommender Systems
Hideaki Joko, Faegheh Hasibi
TL;DR
This work tackles the challenge of evaluating Conversational Recommender Systems (CRSs) without gold references by introducing FACE, a Fine-grained Aspect-based Conversation Evaluation method. FACE decomposes conversations into atomic conversation particles, evaluates each particle with optimized, diverse LLM instructions, and aggregates scores for turn- and dialogue-level aspects, yielding interpretable diagnostics and enabling pinpointed troubleshooting. The method achieves strong alignment with human judgments (system-level ~0.9; turn/dialogue-level ~0.5) and generalizes across LLMs and domains, significantly outperforming state-of-the-art baselines. The authors also release CRSArena-Eval, a multi-turn annotation dataset, and demonstrate FACE’s interpretability through case studies, alongside analyses of sample efficiency and biases, underscoring FACE as a scalable, insightful companion to human evaluation in CRS development.
Abstract
A systematic, reliable, and low-cost evaluation of Conversational Recommender Systems (CRSs) remains an open challenge. Existing automatic CRS evaluation methods are proven insufficient for evaluating the dynamic nature of recommendation conversations. This work proposes FACE: a Fine-grained, Aspect-based Conversation Evaluation method that provides evaluation scores for diverse turn and dialogue level qualities of recommendation conversations. FACE is reference-free and shows strong correlation with human judgments, achieving system correlation of 0.9 and turn/dialogue-level of 0.5, outperforming state-of-the-art CRS evaluation methods by a large margin. Additionally, unlike existing LLM-based methods that provide single uninterpretable scores, FACE provides insights into the system performance and enables identifying and locating problems within conversations.
