Table of Contents
Fetching ...

Flipping the Dialogue: Training and Evaluating User Language Models

Tarek Naous, Philippe Laban, Wei Xu, Jennifer Neville

TL;DR

This work demonstrates that assistant-trained simulators poorly reflect real user behavior in multi-turn conversations, motivating purpose-built User Language Models (User LMs) trained to simulate users with high-level intents. By flipping real human–assistant dialogues and conditioning on intent and conversation state, the authors train User LMs that initiate, refine, and terminate interactions, achieving superior distributional alignment and robustness compared to baselines. Intrinsic evaluations show User LMs produce more diverse, intent-decomposed, and endable dialogues, closely mirroring human behavior; scaling improves performance, while stronger assistant LMs do not necessarily improve user simulation. Extrinsic experiments reveal that using User LMs to simulate coding and math conversations yields a more realistic assessment of assistant performance, with GPT-4o's success rate dropping when faced with human-like user behavior. The authors release UserLM-8b to spur open-source research and discuss implications for personalized user simulations and safer, more robust assistant development.

Abstract

Conversations with LMs involve two participants: a human user leading the conversation, and an LM assistant responding to the user's request. To satisfy this specific role, LMs are post-trained to be helpful assistants -- optimized to produce exhaustive and well-structured responses, free of ambiguity and grammar errors. User utterances, on the other hand, are rarely perfected, with each user phrasing requests in unique ways, sometimes putting in partial effort at each turn and refining on the fly. To evaluate LM performance in realistic settings, prior work simulated users in multi-turn conversations, often prompting an LLM originally trained to be a helpful assistant to act as a user. However, we show that assistant LMs make for poor user simulators, with the surprising finding that better assistants yield worse simulators. Instead, we introduce purpose-built User Language Models (User LMs) - models post-trained to simulate human users in multi-turn conversations. Through various evaluations, we show how User LMs align better with human behavior and achieve better simulation robustness than existing simulation methods. When leveraging User LMs to simulate coding and math conversations, the performance of a strong assistant (GPT-4o) drops from 74.6% to 57.4%, confirming that more realistic simulation environments lead to assistant struggles as they fail to cope with the nuances of users in multi-turn setups.

Flipping the Dialogue: Training and Evaluating User Language Models

TL;DR

This work demonstrates that assistant-trained simulators poorly reflect real user behavior in multi-turn conversations, motivating purpose-built User Language Models (User LMs) trained to simulate users with high-level intents. By flipping real human–assistant dialogues and conditioning on intent and conversation state, the authors train User LMs that initiate, refine, and terminate interactions, achieving superior distributional alignment and robustness compared to baselines. Intrinsic evaluations show User LMs produce more diverse, intent-decomposed, and endable dialogues, closely mirroring human behavior; scaling improves performance, while stronger assistant LMs do not necessarily improve user simulation. Extrinsic experiments reveal that using User LMs to simulate coding and math conversations yields a more realistic assessment of assistant performance, with GPT-4o's success rate dropping when faced with human-like user behavior. The authors release UserLM-8b to spur open-source research and discuss implications for personalized user simulations and safer, more robust assistant development.

Abstract

Conversations with LMs involve two participants: a human user leading the conversation, and an LM assistant responding to the user's request. To satisfy this specific role, LMs are post-trained to be helpful assistants -- optimized to produce exhaustive and well-structured responses, free of ambiguity and grammar errors. User utterances, on the other hand, are rarely perfected, with each user phrasing requests in unique ways, sometimes putting in partial effort at each turn and refining on the fly. To evaluate LM performance in realistic settings, prior work simulated users in multi-turn conversations, often prompting an LLM originally trained to be a helpful assistant to act as a user. However, we show that assistant LMs make for poor user simulators, with the surprising finding that better assistants yield worse simulators. Instead, we introduce purpose-built User Language Models (User LMs) - models post-trained to simulate human users in multi-turn conversations. Through various evaluations, we show how User LMs align better with human behavior and achieve better simulation robustness than existing simulation methods. When leveraging User LMs to simulate coding and math conversations, the performance of a strong assistant (GPT-4o) drops from 74.6% to 57.4%, confirming that more realistic simulation environments lead to assistant struggles as they fail to cope with the nuances of users in multi-turn setups.

Paper Structure

This paper contains 72 sections, 9 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Comparison of simulating users in conversations by prompting an assistant LM (GPT-4o) to roleplay a user () vs. our user language model UserLM-8b (). Both simulators converse with an assistant (GPT-4o) to solve a coding problem. The GPT-4o-based simulator produces simple and direct user turns, enabling the assistant to successfully solve the task. In contrast, UserLM-8b reveals the intent in a correct but paraphrased form, leading the assistant to fail on the task. UserLM-8b is more aligned with the behavior of real users, helping better estimate the performance of assistants in realistic, multi-turn conversations.
  • Figure 2: A diagram illustrating our approach to train a UserLM (). We leverage in-the-wild human-assistant conversations, generating high-level user intents for each conversation. We then flip the dialogue, turning each conversation with K turns into K+1 training samples, conditioning both on the high-level intent and conversation state to generate the next user utterance.
  • Figure 3: Comparison of different training setups for our user LMs: (a) Effect of training with conditioning on the generic intent; (b) Effect of starting from the base vs. instruction-tuned checkpoints.
  • Figure 4: Per-turn token-level PPL achieved by models on PRISM utterances. All models are conditioned on the generic user intent of each conversation. Our user LMs outperform all baselines and achieve much lower PPL, especially at the first turn.
  • Figure 5: Cumulative n-gram overlap between generated user turns and the generic intent of each conversation. Results are averaged across each turn for all conversations in PRISM. Our user LMs achieve the lowest cumulative n-gram overlap with the intent, aligning with real human utterances.
  • ...and 5 more figures