PersoBench: Benchmarking Personalized Response Generation in Large Language Models
Saleh Afzoon, Zahra Jamali, Usman Naseem, Amin Beheshti
TL;DR
PersoBench addresses the under-explored problem of evaluating personalized response generation in LLM-driven dialogues by introducing an automated zero-shot benchmarking pipeline with structured prompts, speaker labeling, and eight multi-dimensional metrics. The generation task is formalized as $P(r \mid C, P; \theta) = \prod_{t=1}^{T} P(r_t \mid r_{1:t-1}, C, P; \theta)$, and the framework evaluates eight LLMs (four open-source, four closed-source) across three persona datasets under vanilla and Chain-of-Thought prompting. Empirical results show that while LLMs produce fluent and diverse responses, they struggle to deliver coherent and persona-consistent outputs, with CoT prompting offering varying benefits depending on context and model. PersoBench provides a reproducible baseline for multi-faceted personalization evaluation and contributes a public benchmark and results for future improvements in personalized dialogue systems.
Abstract
While large language models (LLMs) have exhibited impressive conversational capabilities, their proficiency in delivering personalized responses remains unclear. Although recent benchmarks automatically evaluate persona consistency in role-playing contexts using LLM-based judgment, the evaluation of personalization in response generation remains underexplored. To address this gap, we present an automated benchmarking pipeline, PersoBench, to evaluate the personalization ability of LLMs in persona-aware dialogue generation within a zero-shot setting. Our framework employs a structured pipeline comprising speaker-aware annotation, task-specific and context-driven prompt construction, response post-processing, and automated evaluation across multiple dimensions of generation quality. In particular, the pipeline performs text preprocessing and speaker labeling, constructs structured prompts with task instructions and LLM roles, validates response format, and evaluates valid outputs across fluency, personalization, diversity, and coherence. We assess the performance of four open-source and four closed-source LLMs using well-known datasets and a range of explicit metrics. Our findings reveal that while LLMs excel at generating fluent and diverse responses, they are far from satisfactory in delivering personalized and coherent responses, considering both the conversation context and the provided personas.
