Automated Evaluation of Personalized Text Generation using Large Language Models

Yaqing Wang; Jiepu Jiang; Mingyang Zhang; Cheng Li; Yi Liang; Qiaozhu Mei; Michael Bendersky

Automated Evaluation of Personalized Text Generation using Large Language Models

Yaqing Wang, Jiepu Jiang, Mingyang Zhang, Cheng Li, Yi Liang, Qiaozhu Mei, Michael Bendersky

TL;DR

The paper addresses the challenge of evaluating personalized text generation, where traditional metrics fail to capture user-specific nuances. It proposes AuPEL, a novel evaluation framework that leverages large language models to automatically assess three semantic aspects of generated text: personalization, quality, and relevance. Through carefully controlled experiments, AuPEL's judgments are compared against human annotations, demonstrating higher accuracy, consistency, and efficiency than conventional text-similarity metrics in ranking models by personalization capability. The work argues for adopting LLM-based evaluation in personalization contexts while acknowledging remaining methodological challenges.

Abstract

Personalized text generation presents a specialized mechanism for delivering content that is specific to a user's personal context. While the research progress in this area has been rapid, evaluation still presents a challenge. Traditional automated metrics such as BLEU and ROUGE primarily measure lexical similarity to human-written references, and are not able to distinguish personalization from other subtle semantic aspects, thus falling short of capturing the nuances of personalized generated content quality. On the other hand, human judgments are costly to obtain, especially in the realm of personalized evaluation. Inspired by these challenges, we explore the use of large language models (LLMs) for evaluating personalized text generation, and examine their ability to understand nuanced user context. We present AuPEL, a novel evaluation method that distills three major semantic aspects of the generated text: personalization, quality and relevance, and automatically measures these aspects. To validate the effectiveness of AuPEL, we design carefully controlled experiments and compare the accuracy of the evaluation judgments made by LLMs versus that of judgements made by human annotators, and conduct rigorous analyses of the consistency and sensitivity of the proposed metric. We find that, compared to existing evaluation metrics, AuPEL not only distinguishes and ranks models based on their personalization abilities more accurately, but also presents commendable consistency and efficiency for this task. Our work suggests that using LLMs as the evaluators of personalized text generation is superior to traditional text similarity metrics, even though interesting new challenges still remain.

Automated Evaluation of Personalized Text Generation using Large Language Models

TL;DR

Abstract

Automated Evaluation of Personalized Text Generation using Large Language Models

Authors

TL;DR

Abstract

Table of Contents