A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations

Li Li; Peilin Cai; Ryan A. Rossi; Franck Dernoncourt; Branislav Kveton; Junda Wu; Tong Yu; Linxin Song; Tiankai Yang; Yuehan Qin; Nesreen K. Ahmed; Samyadeep Basu; Subhojyoti Mukherjee; Ruiyi Zhang; Zhengmian Hu; Bo Ni; Yuxiao Zhou; Zichao Wang; Yue Huang; Yu Wang; Xiangliang Zhang; Philip S. Yu; Xiyang Hu; Yue Zhao

A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations

Li Li, Peilin Cai, Ryan A. Rossi, Franck Dernoncourt, Branislav Kveton, Junda Wu, Tong Yu, Linxin Song, Tiankai Yang, Yuehan Qin, Nesreen K. Ahmed, Samyadeep Basu, Subhojyoti Mukherjee, Ruiyi Zhang, Zhengmian Hu, Bo Ni, Yuxiao Zhou, Zichao Wang, Yue Huang, Yu Wang, Xiangliang Zhang, Philip S. Yu, Xiyang Hu, Yue Zhao

TL;DR

PersonaConvBench introduces the first benchmark that jointly addresses personalized conversational reasoning and multi-turn structure across ten Reddit domains. It defines four core structures—message representation, a conversational graph, trajectories, and user history—and three tasks (classification, regression, generation) evaluated via a temporal setting with robust metrics. Experiments with multiple LLMs show that leveraging personalized conversation history yields substantial gains across all tasks, demonstrating the value of structured user history and trajectory-aware prompts. The work provides data, code, and benchmarks to advance research on user-adaptive dialogue systems and real-world applications such as personalized customer support and adaptive virtual assistants.

Abstract

We present PersonaConvBench, a large-scale benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs). Unlike existing work that focuses on either personalization or conversational structure in isolation, PersonaConvBench integrates both, offering three core tasks: sentence classification, impact regression, and user-centric text generation across ten diverse Reddit-based domains. This design enables systematic analysis of how personalized conversational context shapes LLM outputs in realistic multi-user scenarios. We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements, including a 198 percent relative gain over the best non-conversational baseline in sentiment classification. By releasing PersonaConvBench with evaluations and code, we aim to support research on LLMs that adapt to individual styles, track long-term context, and produce contextually rich, engaging responses.

A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations

TL;DR

Abstract

A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)