Table of Contents
Fetching ...

Large Language Models as Evaluators for Recommendation Explanations

Xiaoyu Zhang, Yishan Li, Jiayin Wang, Bowen Sun, Weizhi Ma, Peijie Sun, Min Zhang

TL;DR

This paper investigates whether large language models can serve as evaluators for the quality of recommendation explanations by aligning their judgments with ground-truth user feedback through a three-level meta-evaluation framework. Using a Movielens-derived dataset with 39 users and around 2,500 Chinese explanations, the study compares LLM based assessments to user ratings, third-party annotations, and reference metrics like BLEU and ROUGE. It finds that zero-shot GPT-4 can achieve evaluation accuracy competitive with traditional methods, that in-context learning and personalized prompts can further improve alignment, and that ensembles of diverse LLM evaluators enhance stability and accuracy. The work argues for LLM-based evaluators as reproducible and cost-effective tools for evaluating explanation quality in explainable recommendations and provides practical guidance on prompt design and model ensembles.

Abstract

The explainability of recommender systems has attracted significant attention in academia and industry. Many efforts have been made for explainable recommendations, yet evaluating the quality of the explanations remains a challenging and unresolved issue. In recent years, leveraging LLMs as evaluators presents a promising avenue in Natural Language Processing tasks (e.g., sentiment classification, information extraction), as they perform strong capabilities in instruction following and common-sense reasoning. However, evaluating recommendation explanatory texts is different from these NLG tasks, as its criteria are related to human perceptions and are usually subjective. In this paper, we investigate whether LLMs can serve as evaluators of recommendation explanations. To answer the question, we utilize real user feedback on explanations given from previous work and additionally collect third-party annotations and LLM evaluations. We design and apply a 3-level meta evaluation strategy to measure the correlation between evaluator labels and the ground truth provided by users. Our experiments reveal that LLMs, such as GPT4, can provide comparable evaluations with appropriate prompts and settings. We also provide further insights into combining human labels with the LLM evaluation process and utilizing ensembles of multiple heterogeneous LLM evaluators to enhance the accuracy and stability of evaluations. Our study verifies that utilizing LLMs as evaluators can be an accurate, reproducible and cost-effective solution for evaluating recommendation explanation texts. Our code is available at https://github.com/Xiaoyu-SZ/LLMasEvaluator.

Large Language Models as Evaluators for Recommendation Explanations

TL;DR

This paper investigates whether large language models can serve as evaluators for the quality of recommendation explanations by aligning their judgments with ground-truth user feedback through a three-level meta-evaluation framework. Using a Movielens-derived dataset with 39 users and around 2,500 Chinese explanations, the study compares LLM based assessments to user ratings, third-party annotations, and reference metrics like BLEU and ROUGE. It finds that zero-shot GPT-4 can achieve evaluation accuracy competitive with traditional methods, that in-context learning and personalized prompts can further improve alignment, and that ensembles of diverse LLM evaluators enhance stability and accuracy. The work argues for LLM-based evaluators as reproducible and cost-effective tools for evaluating explanation quality in explainable recommendations and provides practical guidance on prompt design and model ensembles.

Abstract

The explainability of recommender systems has attracted significant attention in academia and industry. Many efforts have been made for explainable recommendations, yet evaluating the quality of the explanations remains a challenging and unresolved issue. In recent years, leveraging LLMs as evaluators presents a promising avenue in Natural Language Processing tasks (e.g., sentiment classification, information extraction), as they perform strong capabilities in instruction following and common-sense reasoning. However, evaluating recommendation explanatory texts is different from these NLG tasks, as its criteria are related to human perceptions and are usually subjective. In this paper, we investigate whether LLMs can serve as evaluators of recommendation explanations. To answer the question, we utilize real user feedback on explanations given from previous work and additionally collect third-party annotations and LLM evaluations. We design and apply a 3-level meta evaluation strategy to measure the correlation between evaluator labels and the ground truth provided by users. Our experiments reveal that LLMs, such as GPT4, can provide comparable evaluations with appropriate prompts and settings. We also provide further insights into combining human labels with the LLM evaluation process and utilizing ensembles of multiple heterogeneous LLM evaluators to enhance the accuracy and stability of evaluations. Our study verifies that utilizing LLMs as evaluators can be an accurate, reproducible and cost-effective solution for evaluating recommendation explanation texts. Our code is available at https://github.com/Xiaoyu-SZ/LLMasEvaluator.
Paper Structure (17 sections, 6 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 17 sections, 6 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Traditional evaluation approaches vs. utilizing LLMs for evaluations.
  • Figure 2: The outline of evaluation prompt templates applied in our study.
  • Figure 3: Comparison of 3-level Pearson correlations for zero-shot, (non-personalized) one-shot, and personalized one-shot learning on GPT4(M) and Qwen(M). (M) denotes the multiple-aspect version.
  • Figure 4: Distribution of evaluation accuracy from ensemble results of #N LLM evaluators. Values on the x-axis denote #N.