PersoBench: Benchmarking Personalized Response Generation in Large Language Models

Saleh Afzoon; Zahra Jamali; Usman Naseem; Amin Beheshti

PersoBench: Benchmarking Personalized Response Generation in Large Language Models

Saleh Afzoon, Zahra Jamali, Usman Naseem, Amin Beheshti

TL;DR

PersoBench addresses the under-explored problem of evaluating personalized response generation in LLM-driven dialogues by introducing an automated zero-shot benchmarking pipeline with structured prompts, speaker labeling, and eight multi-dimensional metrics. The generation task is formalized as $P(r \mid C, P; \theta) = \prod_{t=1}^{T} P(r_t \mid r_{1:t-1}, C, P; \theta)$, and the framework evaluates eight LLMs (four open-source, four closed-source) across three persona datasets under vanilla and Chain-of-Thought prompting. Empirical results show that while LLMs produce fluent and diverse responses, they struggle to deliver coherent and persona-consistent outputs, with CoT prompting offering varying benefits depending on context and model. PersoBench provides a reproducible baseline for multi-faceted personalization evaluation and contributes a public benchmark and results for future improvements in personalized dialogue systems.

Abstract

While large language models (LLMs) have exhibited impressive conversational capabilities, their proficiency in delivering personalized responses remains unclear. Although recent benchmarks automatically evaluate persona consistency in role-playing contexts using LLM-based judgment, the evaluation of personalization in response generation remains underexplored. To address this gap, we present an automated benchmarking pipeline, PersoBench, to evaluate the personalization ability of LLMs in persona-aware dialogue generation within a zero-shot setting. Our framework employs a structured pipeline comprising speaker-aware annotation, task-specific and context-driven prompt construction, response post-processing, and automated evaluation across multiple dimensions of generation quality. In particular, the pipeline performs text preprocessing and speaker labeling, constructs structured prompts with task instructions and LLM roles, validates response format, and evaluates valid outputs across fluency, personalization, diversity, and coherence. We assess the performance of four open-source and four closed-source LLMs using well-known datasets and a range of explicit metrics. Our findings reveal that while LLMs excel at generating fluent and diverse responses, they are far from satisfactory in delivering personalized and coherent responses, considering both the conversation context and the provided personas.

PersoBench: Benchmarking Personalized Response Generation in Large Language Models

TL;DR

, and the framework evaluates eight LLMs (four open-source, four closed-source) across three persona datasets under vanilla and Chain-of-Thought prompting. Empirical results show that while LLMs produce fluent and diverse responses, they struggle to deliver coherent and persona-consistent outputs, with CoT prompting offering varying benefits depending on context and model. PersoBench provides a reproducible baseline for multi-faceted personalization evaluation and contributes a public benchmark and results for future improvements in personalized dialogue systems.

Abstract

Paper Structure (17 sections, 1 equation, 3 figures, 10 tables)

This paper contains 17 sections, 1 equation, 3 figures, 10 tables.

Introduction
Related Work
Persona-aware Response Generation
LLM Evaluation Approaches
LLM Evaluation Frameworks
PersoBench
Problem statement
Overview of PersonBench
Prompt Development
Experiment
Implementation setup
Results and Analysis
Discussion
CONCLUSION AND FUTURE WORK
Explicit prompt samples
...and 2 more sections

Figures (3)

Figure 1: Overview of the PersoBench automatic personalization benchmarking pipeline.
Figure 2: Performance analysis of (a) open-source LLMs in vanilla setting, (b) closed-source LLMs in vanilla setting, (c) open-source LLMs in CoT setting, (d) closed-source LLMs in CoT setting on the FoCus dataset.
Figure 3: Analysis of response time, failure ratio, and engagingness of LLMs on the FoCus dataset: (a) Vanilla and (b) CoT setups.

PersoBench: Benchmarking Personalized Response Generation in Large Language Models

TL;DR

Abstract

PersoBench: Benchmarking Personalized Response Generation in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)