Table of Contents
Fetching ...

Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning

Marwa Abdulhai, Ryan Cheng, Donovan Clay, Tim Althoff, Sergey Levine, Natasha Jaques

TL;DR

This paper tackles the problem that LLM-based human simulators frequently drift from prescribed personas in multi-turn dialogues, which can undermine downstream training and evaluation. It introduces a unified framework with three automatic consistency metrics—Prompt-to-Line, Line-to-Line, and Q&A Consistency—validated against human judgments, and leverages them as rewards in a multi-turn PPO fine-tuning regime to produce more faithful simulated users across three domains (open-ended conversation, education, mental health). Empirical results show that consistency improves by over 55% and that PPO-based fine-tuning outperforms supervised and offline RL baselines, with robust performance across different model sizes and longer dialogues. The framework enables scalable, persona-grounded evaluation and optimization of LLM-based user simulators, enabling more reliable training and evaluation pipelines for downstream AI agents, while acknowledging limitations around static personas, long-horizon dynamics, and ethical risks in simulated human interactions.

Abstract

Large Language Models (LLMs) are increasingly used to simulate human users in interactive settings such as therapy, education, and social role-play. While these simulations enable scalable training and evaluation of AI agents, off-the-shelf LLMs often drift from their assigned personas, contradict earlier statements, or abandon role-appropriate behavior. We introduce a unified framework for evaluating and improving persona consistency in LLM-generated dialogue. We define three automatic metrics: prompt-to-line consistency, line-to-line consistency, and Q&A consistency, that capture different types of persona drift and validate each against human annotations. Using these metrics as reward signals, we apply multi-turn reinforcement learning to fine-tune LLMs for three user roles: a patient, a student, and a social chat partner. Our method reduces inconsistency by over 55%, resulting in more coherent and faithful simulated users.

Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning

TL;DR

This paper tackles the problem that LLM-based human simulators frequently drift from prescribed personas in multi-turn dialogues, which can undermine downstream training and evaluation. It introduces a unified framework with three automatic consistency metrics—Prompt-to-Line, Line-to-Line, and Q&A Consistency—validated against human judgments, and leverages them as rewards in a multi-turn PPO fine-tuning regime to produce more faithful simulated users across three domains (open-ended conversation, education, mental health). Empirical results show that consistency improves by over 55% and that PPO-based fine-tuning outperforms supervised and offline RL baselines, with robust performance across different model sizes and longer dialogues. The framework enables scalable, persona-grounded evaluation and optimization of LLM-based user simulators, enabling more reliable training and evaluation pipelines for downstream AI agents, while acknowledging limitations around static personas, long-horizon dynamics, and ethical risks in simulated human interactions.

Abstract

Large Language Models (LLMs) are increasingly used to simulate human users in interactive settings such as therapy, education, and social role-play. While these simulations enable scalable training and evaluation of AI agents, off-the-shelf LLMs often drift from their assigned personas, contradict earlier statements, or abandon role-appropriate behavior. We introduce a unified framework for evaluating and improving persona consistency in LLM-generated dialogue. We define three automatic metrics: prompt-to-line consistency, line-to-line consistency, and Q&A consistency, that capture different types of persona drift and validate each against human annotations. Using these metrics as reward signals, we apply multi-turn reinforcement learning to fine-tune LLMs for three user roles: a patient, a student, and a social chat partner. Our method reduces inconsistency by over 55%, resulting in more coherent and faithful simulated users.

Paper Structure

This paper contains 53 sections, 3 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: We begin by generating dialogue with open-source instruction-tuned models conditioned on user persona/strategy prompts. We then evaluate the generated conversations using three metrics: prompt-to-line consistency which checks alignment with the initial persona, line-to-line consistency which detects contradictions within a conversation; and Q&A consistency which probes for stable beliefs and strategy over time. Finally, we perform multi-turn RL fine-tuning with these metrics to achieve greater consistency in dialogue.
  • Figure 2: Examples of inconsistencies detected by our evaluation metrics. Each panel highlights a different form of consistency failure across tasks. Left: Prompt-to-line inconsistency in Open-ended conversation where the agent contradicts its persona background. Middle: Line-to-line inconsistency in Education task where the student gives conflicting responses within the same conversation. Right: Q&A consistency failure in Mental Health Task where the agent’s self-reported feelings conflict with its stated beliefs.
  • Figure 3: Pairwise consistency agreement across metrics and tasks. Each heatmap shows the fraction of utterances where two consistency metrics agree in their classification (consistent vs. inconsistent) averaged across models. We observe strong alignment between prompt-to-line and line-to-line consistency but weaker agreement with Q&A consistency, indicating surface-level coherence without stable long-term beliefs. We also observe task-specific trends, such as stronger alignment in Education dialogues and more conflicting patterns in Mental Health, demonstrating the importance of using complementary metrics to evaluate consistency.
  • Figure 4: Prompt Consistency Across Fine-Tuning Methods. We compare prompt-to-line consistency metric for four methods—baseline Llama-8B-instruct model, supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Proximal Policy Optimization (PPO, ours) across open-ended conversation, education, and mental health tasks (mean/std shown). PPO achieves the highest consistency in all tasks, with particularly strong gains in education and mental health.
  • Figure 5: Conversation length vs. consistency across three metrics. Each subplot shows the mean score (with error bars) for Llama-3.1-8B-Instruct, gemma-2-2b-it and mistral-instruct at varying conversation lengths: (a) prompt consistency, (b) line-to-line consistency, (c) Q&A consistency.