Table of Contents
Fetching ...

A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations

Li Li, Peilin Cai, Ryan A. Rossi, Franck Dernoncourt, Branislav Kveton, Junda Wu, Tong Yu, Linxin Song, Tiankai Yang, Yuehan Qin, Nesreen K. Ahmed, Samyadeep Basu, Subhojyoti Mukherjee, Ruiyi Zhang, Zhengmian Hu, Bo Ni, Yuxiao Zhou, Zichao Wang, Yue Huang, Yu Wang, Xiangliang Zhang, Philip S. Yu, Xiyang Hu, Yue Zhao

TL;DR

PersonaConvBench introduces the first benchmark that jointly addresses personalized conversational reasoning and multi-turn structure across ten Reddit domains. It defines four core structures—message representation, a conversational graph, trajectories, and user history—and three tasks (classification, regression, generation) evaluated via a temporal setting with robust metrics. Experiments with multiple LLMs show that leveraging personalized conversation history yields substantial gains across all tasks, demonstrating the value of structured user history and trajectory-aware prompts. The work provides data, code, and benchmarks to advance research on user-adaptive dialogue systems and real-world applications such as personalized customer support and adaptive virtual assistants.

Abstract

We present PersonaConvBench, a large-scale benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs). Unlike existing work that focuses on either personalization or conversational structure in isolation, PersonaConvBench integrates both, offering three core tasks: sentence classification, impact regression, and user-centric text generation across ten diverse Reddit-based domains. This design enables systematic analysis of how personalized conversational context shapes LLM outputs in realistic multi-user scenarios. We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements, including a 198 percent relative gain over the best non-conversational baseline in sentiment classification. By releasing PersonaConvBench with evaluations and code, we aim to support research on LLMs that adapt to individual styles, track long-term context, and produce contextually rich, engaging responses.

A Personalized Conversational Benchmark: Towards Simulating Personalized Conversations

TL;DR

PersonaConvBench introduces the first benchmark that jointly addresses personalized conversational reasoning and multi-turn structure across ten Reddit domains. It defines four core structures—message representation, a conversational graph, trajectories, and user history—and three tasks (classification, regression, generation) evaluated via a temporal setting with robust metrics. Experiments with multiple LLMs show that leveraging personalized conversation history yields substantial gains across all tasks, demonstrating the value of structured user history and trajectory-aware prompts. The work provides data, code, and benchmarks to advance research on user-adaptive dialogue systems and real-world applications such as personalized customer support and adaptive virtual assistants.

Abstract

We present PersonaConvBench, a large-scale benchmark for evaluating personalized reasoning and generation in multi-turn conversations with large language models (LLMs). Unlike existing work that focuses on either personalization or conversational structure in isolation, PersonaConvBench integrates both, offering three core tasks: sentence classification, impact regression, and user-centric text generation across ten diverse Reddit-based domains. This design enables systematic analysis of how personalized conversational context shapes LLM outputs in realistic multi-user scenarios. We benchmark several commercial and open-source LLMs under a unified prompting setup and observe that incorporating personalized history yields substantial performance improvements, including a 198 percent relative gain over the best non-conversational baseline in sentiment classification. By releasing PersonaConvBench with evaluations and code, we aim to support research on LLMs that adapt to individual styles, track long-term context, and produce contextually rich, engaging responses.

Paper Structure

This paper contains 51 sections, 6 equations, 10 figures, 20 tables.

Figures (10)

  • Figure 1: Illustration of the personalized conversational setting in PersonaConvBench. The center black node represents a user $v$ who initiates a post, leading to multiple conversational trajectories as other users respond. Over time, user $v$ replies to some of these users, forming deeper branches in the graph. For each reply made by user $v$, we can construct a prediction task—either classifying the response’s sentiment, forecasting its community score, or generating the response text—based on the earlier parts of the same trajectory and additional user-specific history. These tasks rely on realistic multi-user, multi-turn settings with graph-structured conversational data. Each message is annotated with username $v$, timestamp $t$, message content $x$, and feedback score $s$, supporting fine-grained personalization across all task types.
  • Figure 2: Performance of GPT-4.1 on our personalized conversation benchmark. Incorporating personalized conversational context significantly improves model performance across all tasks and evaluation metrics. Notably, the P-Conv variant consistently outperforms the non-personalized baselines (NP-Conv and P-NonConv) in classification, regression, and text generation metrics. Note: RMSE and MAE are normalized to $[0, 1]$ (higher is better), using the formulas: $\text{RMSE}_{\text{scaled}} = \frac{360 - \text{RMSE}}{70}$ and $\text{MAE}_{\text{scaled}} = \frac{120 - \text{MAE}}{30}$. For radar charts showing the performances of other models, please see Fig. \ref{['fig:radar_claude35']}- \ref{['fig:radar_deepseekr1']} in Appendix.
  • Figure A: In-context prompt construction for personalized conversational inference. Given a held-out user trajectory set, we sample a prefix from the current test thread and draw a demonstration example from a different random thread by the same user. The prefix and demonstration, along with user history outside the test thread, are composed into a single prompt for in-context learning. This unified formulation supports three tasks—sentiment classification, impact forecasting, and next-text generation—by conditioning the LLM on personalized multi-turn context without future leakage.
  • Figure B: Illustration of the personalized conversational setting in PersonaConvBench. The center black node represents a user $v$ who initiates a post, leading to multiple conversational trajectories as other users respond. Over time, user $v$ replies to some of these users, forming deeper branches in the graph. For each reply made by user $v$, we can construct a prediction task—either classifying the response’s sentiment, forecasting its community score, or generating the response text—based on the earlier parts of the same trajectory and additional user-specific history. These tasks rely on realistic multi-user, multi-turn settings with graph-structured conversational data. Each message is annotated with username $v$, timestamp $t$, message content $x$, and feedback score $s$, supporting fine-grained personalization across all task types.
  • Figure C: Case study of the Personalized Conversational Follow-up Text Generation task. The model is asked to generate a masked reply based on the full conversational context and a user-specific example. Output is evaluated using both lexical (e.g., ROUGE, BLEU) and semantic (e.g., SBERT) metrics.
  • ...and 5 more figures