PersonaGym: Evaluating Persona Agents and LLMs
Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, Vishvak Murahari
TL;DR
PersonaGym introduces a dynamic evaluation framework for persona agents and PersonaScore, a human-aligned automatic metric grounded in decision theory. It couples environment selection, persona-specific task generation, and ensemble evaluator scoring to assess five dimensions of persona fidelity across $| ext{E} | = 150$ environments and $| ext{Q} | = 10{,}000$ questions for $200$ personas. Experiments across ten LLMs reveal that model size does not reliably predict persona capabilities and that state-of-the-art models can underperform on multidimensional persona tasks, with linguistic habits and role-playing resistance emerging as key challenges. Human studies show strong correlations with PersonaScore, supporting its validity for large-scale automated persona evaluation and underscoring the framework's potential to guide future persona-agent development. The work provides a practical benchmark and a rigorous, theory-grounded approach for evaluating personalized agents in diverse, real-world settings.
Abstract
Persona agents, which are LLM agents conditioned to act according to an assigned persona, enable contextually rich and user aligned interactions across domains like education and healthcare. However, evaluating how faithfully these agents adhere to their personas remains a significant challenge, particularly in free-form settings that demand consistency across diverse, persona-relevant environments. We introduce PersonaGym, the first dynamic evaluation framework for persona agents, and PersonaScore, a human-aligned automatic metric grounded in decision theory that enables comprehensive large-scale evaluation. Our evaluation of 10 leading LLMs across 200 personas and 10,000 questions reveals significant advancement opportunities. For example, GPT-4.1 had the exact same PersonaScore as LLaMA-3-8b despite being a more recent and advanced closed source model. Importantly, increased model size and complexity do not necessarily enhance persona agent capabilities, underscoring the need for algorithmic and architectural innovation toward faithful, performant persona agents.
