Evaluating Style-Personalized Text Generation: Challenges and Directions
Anubhav Jangra, Bahareh Sarrafzadeh, Silviu Cucerzan, Adrian de Wynter, Sujay Kumar Jauhar
TL;DR
This work critically evaluates how style-personalized text generation is measured, introducing a style-discrimination benchmark across eight datasets and three evaluation settings (domain, authorship, and LLM personalization). It compares traditional n-gram metrics, style-embedding approaches, and LLM-based judges, revealing that single metrics often underperform and correlations with human judgments are weak, especially in LLM-driven scenarios. The study demonstrates that ensembles of diverse metrics consistently outperform individual evaluators, with performance gains from majority voting and performance-weighted voting across settings. It also reveals nuanced insights into prompting strategies for LLM judges, limitations of LLM-based evaluation, and substantial human agreement variability, arguing for more robust, pragmatic metrics tailored to SPTG rather than dataset-driven benchmarks.
Abstract
With the surge of large language models (LLMs) and their ability to produce customized output, style-personalized text generation--"write like me"--has become a rapidly growing area of interest. However, style personalization is highly specific, relative to every user, and depends strongly on the pragmatic context, which makes it uniquely challenging. Although prior research has introduced benchmarks and metrics for this area, they tend to be non-standardized and have known limitations (e.g., poor correlation with human subjects). LLMs have been found to not capture author-specific style well, it follows that the metrics themselves must be scrutinized carefully. In this work we critically examine the effectiveness of the most common metrics used in the field, such as BLEU, embeddings, and LLMs-as-judges. We evaluate these metrics using our proposed style discrimination benchmark, which spans eight diverse writing tasks across three evaluation settings: domain discrimination, authorship attribution, and LLM-generated personalized vs non-personalized discrimination. We find strong evidence that employing ensembles of diverse evaluation metrics consistently outperforms single-evaluator methods, and conclude by providing guidance on how to reliably assess style-personalized text generation.
