Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning
Marwa Abdulhai, Ryan Cheng, Donovan Clay, Tim Althoff, Sergey Levine, Natasha Jaques
TL;DR
This paper tackles the problem that LLM-based human simulators frequently drift from prescribed personas in multi-turn dialogues, which can undermine downstream training and evaluation. It introduces a unified framework with three automatic consistency metrics—Prompt-to-Line, Line-to-Line, and Q&A Consistency—validated against human judgments, and leverages them as rewards in a multi-turn PPO fine-tuning regime to produce more faithful simulated users across three domains (open-ended conversation, education, mental health). Empirical results show that consistency improves by over 55% and that PPO-based fine-tuning outperforms supervised and offline RL baselines, with robust performance across different model sizes and longer dialogues. The framework enables scalable, persona-grounded evaluation and optimization of LLM-based user simulators, enabling more reliable training and evaluation pipelines for downstream AI agents, while acknowledging limitations around static personas, long-horizon dynamics, and ethical risks in simulated human interactions.
Abstract
Large Language Models (LLMs) are increasingly used to simulate human users in interactive settings such as therapy, education, and social role-play. While these simulations enable scalable training and evaluation of AI agents, off-the-shelf LLMs often drift from their assigned personas, contradict earlier statements, or abandon role-appropriate behavior. We introduce a unified framework for evaluating and improving persona consistency in LLM-generated dialogue. We define three automatic metrics: prompt-to-line consistency, line-to-line consistency, and Q&A consistency, that capture different types of persona drift and validate each against human annotations. Using these metrics as reward signals, we apply multi-turn reinforcement learning to fine-tune LLMs for three user roles: a patient, a student, and a social chat partner. Our method reduces inconsistency by over 55%, resulting in more coherent and faithful simulated users.
