Table of Contents
Fetching ...

Evaluating Style-Personalized Text Generation: Challenges and Directions

Anubhav Jangra, Bahareh Sarrafzadeh, Silviu Cucerzan, Adrian de Wynter, Sujay Kumar Jauhar

TL;DR

This work critically evaluates how style-personalized text generation is measured, introducing a style-discrimination benchmark across eight datasets and three evaluation settings (domain, authorship, and LLM personalization). It compares traditional n-gram metrics, style-embedding approaches, and LLM-based judges, revealing that single metrics often underperform and correlations with human judgments are weak, especially in LLM-driven scenarios. The study demonstrates that ensembles of diverse metrics consistently outperform individual evaluators, with performance gains from majority voting and performance-weighted voting across settings. It also reveals nuanced insights into prompting strategies for LLM judges, limitations of LLM-based evaluation, and substantial human agreement variability, arguing for more robust, pragmatic metrics tailored to SPTG rather than dataset-driven benchmarks.

Abstract

With the surge of large language models (LLMs) and their ability to produce customized output, style-personalized text generation--"write like me"--has become a rapidly growing area of interest. However, style personalization is highly specific, relative to every user, and depends strongly on the pragmatic context, which makes it uniquely challenging. Although prior research has introduced benchmarks and metrics for this area, they tend to be non-standardized and have known limitations (e.g., poor correlation with human subjects). LLMs have been found to not capture author-specific style well, it follows that the metrics themselves must be scrutinized carefully. In this work we critically examine the effectiveness of the most common metrics used in the field, such as BLEU, embeddings, and LLMs-as-judges. We evaluate these metrics using our proposed style discrimination benchmark, which spans eight diverse writing tasks across three evaluation settings: domain discrimination, authorship attribution, and LLM-generated personalized vs non-personalized discrimination. We find strong evidence that employing ensembles of diverse evaluation metrics consistently outperforms single-evaluator methods, and conclude by providing guidance on how to reliably assess style-personalized text generation.

Evaluating Style-Personalized Text Generation: Challenges and Directions

TL;DR

This work critically evaluates how style-personalized text generation is measured, introducing a style-discrimination benchmark across eight datasets and three evaluation settings (domain, authorship, and LLM personalization). It compares traditional n-gram metrics, style-embedding approaches, and LLM-based judges, revealing that single metrics often underperform and correlations with human judgments are weak, especially in LLM-driven scenarios. The study demonstrates that ensembles of diverse metrics consistently outperform individual evaluators, with performance gains from majority voting and performance-weighted voting across settings. It also reveals nuanced insights into prompting strategies for LLM judges, limitations of LLM-based evaluation, and substantial human agreement variability, arguing for more robust, pragmatic metrics tailored to SPTG rather than dataset-driven benchmarks.

Abstract

With the surge of large language models (LLMs) and their ability to produce customized output, style-personalized text generation--"write like me"--has become a rapidly growing area of interest. However, style personalization is highly specific, relative to every user, and depends strongly on the pragmatic context, which makes it uniquely challenging. Although prior research has introduced benchmarks and metrics for this area, they tend to be non-standardized and have known limitations (e.g., poor correlation with human subjects). LLMs have been found to not capture author-specific style well, it follows that the metrics themselves must be scrutinized carefully. In this work we critically examine the effectiveness of the most common metrics used in the field, such as BLEU, embeddings, and LLMs-as-judges. We evaluate these metrics using our proposed style discrimination benchmark, which spans eight diverse writing tasks across three evaluation settings: domain discrimination, authorship attribution, and LLM-generated personalized vs non-personalized discrimination. We find strong evidence that employing ensembles of diverse evaluation metrics consistently outperforms single-evaluator methods, and conclude by providing guidance on how to reliably assess style-personalized text generation.

Paper Structure

This paper contains 32 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Pairwise disagreement of evaluation metrics for the authorship attribution evaluation setting. It can be seen from the figure that evaluation metrics across different evaluation paradigms have higher disagreement compared to metrics within the same evaluation paradigms. For example, even though ROUGE-1 and StyleDistance achieve the same overall performance score of 0.722 (see Table \ref{['tab:base_results']}); their disagreement score is 0.35, compared to the lower disagreement scores of ROUGE-1 and BLEU (0.22) and StyleDistance and Wegmann (0.28).
  • Figure 2: Average pairwise BertScorezhang2019bertscore similarity across $T_{ref}$, $T_+$, and $T_-$. Semantic overlap between the candidate set $T_+$, $T_-$ is significantly higher for the LLM compared to other two.
  • Figure 3: Accuracy of $\rho_{all}(PWV)$ across all domains. Enron emails and Reddit microblogs achieve lowest accuracy for $AA$ and $LLM$ evaluation settings, marginally outperforming the Random baseline.