Flipping the Dialogue: Training and Evaluating User Language Models
Tarek Naous, Philippe Laban, Wei Xu, Jennifer Neville
TL;DR
This work demonstrates that assistant-trained simulators poorly reflect real user behavior in multi-turn conversations, motivating purpose-built User Language Models (User LMs) trained to simulate users with high-level intents. By flipping real human–assistant dialogues and conditioning on intent and conversation state, the authors train User LMs that initiate, refine, and terminate interactions, achieving superior distributional alignment and robustness compared to baselines. Intrinsic evaluations show User LMs produce more diverse, intent-decomposed, and endable dialogues, closely mirroring human behavior; scaling improves performance, while stronger assistant LMs do not necessarily improve user simulation. Extrinsic experiments reveal that using User LMs to simulate coding and math conversations yields a more realistic assessment of assistant performance, with GPT-4o's success rate dropping when faced with human-like user behavior. The authors release UserLM-8b to spur open-source research and discuss implications for personalized user simulations and safer, more robust assistant development.
Abstract
Conversations with LMs involve two participants: a human user leading the conversation, and an LM assistant responding to the user's request. To satisfy this specific role, LMs are post-trained to be helpful assistants -- optimized to produce exhaustive and well-structured responses, free of ambiguity and grammar errors. User utterances, on the other hand, are rarely perfected, with each user phrasing requests in unique ways, sometimes putting in partial effort at each turn and refining on the fly. To evaluate LM performance in realistic settings, prior work simulated users in multi-turn conversations, often prompting an LLM originally trained to be a helpful assistant to act as a user. However, we show that assistant LMs make for poor user simulators, with the surprising finding that better assistants yield worse simulators. Instead, we introduce purpose-built User Language Models (User LMs) - models post-trained to simulate human users in multi-turn conversations. Through various evaluations, we show how User LMs align better with human behavior and achieve better simulation robustness than existing simulation methods. When leveraging User LMs to simulate coding and math conversations, the performance of a strong assistant (GPT-4o) drops from 74.6% to 57.4%, confirming that more realistic simulation environments lead to assistant struggles as they fail to cope with the nuances of users in multi-turn setups.
