Table of Contents
Fetching ...

FACE: A Fine-grained Reference Free Evaluator for Conversational Recommender Systems

Hideaki Joko, Faegheh Hasibi

TL;DR

This work tackles the challenge of evaluating Conversational Recommender Systems (CRSs) without gold references by introducing FACE, a Fine-grained Aspect-based Conversation Evaluation method. FACE decomposes conversations into atomic conversation particles, evaluates each particle with optimized, diverse LLM instructions, and aggregates scores for turn- and dialogue-level aspects, yielding interpretable diagnostics and enabling pinpointed troubleshooting. The method achieves strong alignment with human judgments (system-level ~0.9; turn/dialogue-level ~0.5) and generalizes across LLMs and domains, significantly outperforming state-of-the-art baselines. The authors also release CRSArena-Eval, a multi-turn annotation dataset, and demonstrate FACE’s interpretability through case studies, alongside analyses of sample efficiency and biases, underscoring FACE as a scalable, insightful companion to human evaluation in CRS development.

Abstract

A systematic, reliable, and low-cost evaluation of Conversational Recommender Systems (CRSs) remains an open challenge. Existing automatic CRS evaluation methods are proven insufficient for evaluating the dynamic nature of recommendation conversations. This work proposes FACE: a Fine-grained, Aspect-based Conversation Evaluation method that provides evaluation scores for diverse turn and dialogue level qualities of recommendation conversations. FACE is reference-free and shows strong correlation with human judgments, achieving system correlation of 0.9 and turn/dialogue-level of 0.5, outperforming state-of-the-art CRS evaluation methods by a large margin. Additionally, unlike existing LLM-based methods that provide single uninterpretable scores, FACE provides insights into the system performance and enables identifying and locating problems within conversations.

FACE: A Fine-grained Reference Free Evaluator for Conversational Recommender Systems

TL;DR

This work tackles the challenge of evaluating Conversational Recommender Systems (CRSs) without gold references by introducing FACE, a Fine-grained Aspect-based Conversation Evaluation method. FACE decomposes conversations into atomic conversation particles, evaluates each particle with optimized, diverse LLM instructions, and aggregates scores for turn- and dialogue-level aspects, yielding interpretable diagnostics and enabling pinpointed troubleshooting. The method achieves strong alignment with human judgments (system-level ~0.9; turn/dialogue-level ~0.5) and generalizes across LLMs and domains, significantly outperforming state-of-the-art baselines. The authors also release CRSArena-Eval, a multi-turn annotation dataset, and demonstrate FACE’s interpretability through case studies, alongside analyses of sample efficiency and biases, underscoring FACE as a scalable, insightful companion to human evaluation in CRS development.

Abstract

A systematic, reliable, and low-cost evaluation of Conversational Recommender Systems (CRSs) remains an open challenge. Existing automatic CRS evaluation methods are proven insufficient for evaluating the dynamic nature of recommendation conversations. This work proposes FACE: a Fine-grained, Aspect-based Conversation Evaluation method that provides evaluation scores for diverse turn and dialogue level qualities of recommendation conversations. FACE is reference-free and shows strong correlation with human judgments, achieving system correlation of 0.9 and turn/dialogue-level of 0.5, outperforming state-of-the-art CRS evaluation methods by a large margin. Additionally, unlike existing LLM-based methods that provide single uninterpretable scores, FACE provides insights into the system performance and enables identifying and locating problems within conversations.

Paper Structure

This paper contains 23 sections, 9 equations, 4 figures, 6 tables, 2 algorithms.

Figures (4)

  • Figure 1: Illustration of FACE for a turn-level aspect. Instruction optimization generates a set of diverse evaluation instructions for the given aspect (e.g., relevance) based on an initial prompt; see example instructions in Appendix \ref{['app:prompts']}. For evaluation, each conversation is decomposed into particles containing a dialogue act, a mention (text span from a system turn), and user feedback from the following turn. A response distribution is created for each instruction-particle pair and the weighted summation of the scores is computed. The final score is obtained by aggregating the scores across all instructions and particles. For turn-level aspects, aggregation is performed per turn, while for dialogue-level aspects, scores of particles across the entire dialogue are aggregated.
  • Figure 2: Distribution of human annotation scores for seven aspects across nine systems in CRSArena-Eval. The details of seven aspects and nine systems are described in Section \ref{['sec:annotations:process']}.
  • Figure 3: Breakdown analysis comparing BARCOR and UniCRS. Left: FACE evaluation results for each aspect. Right: Scores of Understanding aspect per system turn.
  • Figure 4: Sample efficiency of different evaluation methods for the overall aspect on CRSArena-Eval.