Table of Contents
Fetching ...

PersonaGym: Evaluating Persona Agents and LLMs

Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, Vishvak Murahari

TL;DR

PersonaGym introduces a dynamic evaluation framework for persona agents and PersonaScore, a human-aligned automatic metric grounded in decision theory. It couples environment selection, persona-specific task generation, and ensemble evaluator scoring to assess five dimensions of persona fidelity across $| ext{E} | = 150$ environments and $| ext{Q} | = 10{,}000$ questions for $200$ personas. Experiments across ten LLMs reveal that model size does not reliably predict persona capabilities and that state-of-the-art models can underperform on multidimensional persona tasks, with linguistic habits and role-playing resistance emerging as key challenges. Human studies show strong correlations with PersonaScore, supporting its validity for large-scale automated persona evaluation and underscoring the framework's potential to guide future persona-agent development. The work provides a practical benchmark and a rigorous, theory-grounded approach for evaluating personalized agents in diverse, real-world settings.

Abstract

Persona agents, which are LLM agents conditioned to act according to an assigned persona, enable contextually rich and user aligned interactions across domains like education and healthcare. However, evaluating how faithfully these agents adhere to their personas remains a significant challenge, particularly in free-form settings that demand consistency across diverse, persona-relevant environments. We introduce PersonaGym, the first dynamic evaluation framework for persona agents, and PersonaScore, a human-aligned automatic metric grounded in decision theory that enables comprehensive large-scale evaluation. Our evaluation of 10 leading LLMs across 200 personas and 10,000 questions reveals significant advancement opportunities. For example, GPT-4.1 had the exact same PersonaScore as LLaMA-3-8b despite being a more recent and advanced closed source model. Importantly, increased model size and complexity do not necessarily enhance persona agent capabilities, underscoring the need for algorithmic and architectural innovation toward faithful, performant persona agents.

PersonaGym: Evaluating Persona Agents and LLMs

TL;DR

PersonaGym introduces a dynamic evaluation framework for persona agents and PersonaScore, a human-aligned automatic metric grounded in decision theory. It couples environment selection, persona-specific task generation, and ensemble evaluator scoring to assess five dimensions of persona fidelity across environments and questions for personas. Experiments across ten LLMs reveal that model size does not reliably predict persona capabilities and that state-of-the-art models can underperform on multidimensional persona tasks, with linguistic habits and role-playing resistance emerging as key challenges. Human studies show strong correlations with PersonaScore, supporting its validity for large-scale automated persona evaluation and underscoring the framework's potential to guide future persona-agent development. The work provides a practical benchmark and a rigorous, theory-grounded approach for evaluating personalized agents in diverse, real-world settings.

Abstract

Persona agents, which are LLM agents conditioned to act according to an assigned persona, enable contextually rich and user aligned interactions across domains like education and healthcare. However, evaluating how faithfully these agents adhere to their personas remains a significant challenge, particularly in free-form settings that demand consistency across diverse, persona-relevant environments. We introduce PersonaGym, the first dynamic evaluation framework for persona agents, and PersonaScore, a human-aligned automatic metric grounded in decision theory that enables comprehensive large-scale evaluation. Our evaluation of 10 leading LLMs across 200 personas and 10,000 questions reveals significant advancement opportunities. For example, GPT-4.1 had the exact same PersonaScore as LLaMA-3-8b despite being a more recent and advanced closed source model. Importantly, increased model size and complexity do not necessarily enhance persona agent capabilities, underscoring the need for algorithmic and architectural innovation toward faithful, performant persona agents.
Paper Structure (45 sections, 7 figures, 6 tables)

This paper contains 45 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison of responses between a general LLM (Left: No assigned persona) and a persona-assigned LLM (Right:"a cowboy"). Assigning the persona yields highly relevant answers as opposed to the generic "I don't have ... preferences".
  • Figure 2: In PersonaGym, relevant environments are selected from a pool of 150 diverse options using an LLM reasoner based on persona descriptions. The persona agent is then initialized in these environments and responds to probing questions across five evaluation tasks. Final PersonaScore is determined by two strong LLM evaluators.
  • Figure 3: (Top) distribution of static environments in PersonaGym helping to visualize the diversity of environments from which relevant environments are selected for a given persona. (Bottom) distribution of attributes in personas used in experimentation. (Full-size versions are attached to our Appendix - Figure \ref{['fig:environments_big']}, \ref{['fig:personas_big']}. Examples of complete persona descriptions are also provided in Appendix D).
  • Figure 4: The number of refusals given role-play requests by LLMs. Claude 3 Haiku is strongly opposed to role-play instructions.
  • Figure 5: Cross-evaluation experiment of comparing performance across different question generator and evaluator model combinations for the same sample of 25 personas and environments.
  • ...and 2 more figures