Table of Contents
Fetching ...

HumanLM: Simulating Users with State Alignment Beats Response Imitation

Shirley Wu, Evelyn Choi, Arpandeep Khatua, Zhanghan Wang, Joy He-Yueya, Tharindu Cyril Weerasooriya, Wei Wei, Diyi Yang, Jure Leskovec, James Zou

TL;DR

A novel training framework, HumanLM, is proposed, which builds user simulators that accurately reflect real users and achieves the highest similarity to real user responses and competitive human-likeness scores.

Abstract

Large Language Models (LLMs) are increasingly used to simulate how specific users respond to a given context, enabling more user-centric applications that rely on user feedback. However, existing user simulators mostly imitate surface-level patterns and language styles, which fail to reflect the underlying states of real users (e.g., beliefs and emotions). To address these limitations, we propose a novel training framework, HumanLM, which builds user simulators that accurately reflect real users. Our key insight is that, in addition to generating responses, the model should generate natural-language latent states that align with ground-truth responses through reinforcement learning. These latent states correspond to a set of psychologically grounded state dimensions that drive how real users respond. HumanLM further synthesizes these aligned latent states into responses that accurately represent real users. For extensive evaluation, we develop Humanual, a comprehensive benchmark for simulating real users based on public data. Humanual consists of six large-scale datasets with 26k users and 216k responses in total, spanning diverse tasks such as generating user responses to daily life issues, political blogs, and chat sessions with LLM assistants. Across datasets, HumanLM significantly outperforms alternative approaches, achieving an average relative improvement of 16.3% in alignment scores from an LLM judge. In a real-time simulation study with 111 participants, HumanLM achieves the highest similarity to real user responses and competitive human-likeness scores.

HumanLM: Simulating Users with State Alignment Beats Response Imitation

TL;DR

A novel training framework, HumanLM, is proposed, which builds user simulators that accurately reflect real users and achieves the highest similarity to real user responses and competitive human-likeness scores.

Abstract

Large Language Models (LLMs) are increasingly used to simulate how specific users respond to a given context, enabling more user-centric applications that rely on user feedback. However, existing user simulators mostly imitate surface-level patterns and language styles, which fail to reflect the underlying states of real users (e.g., beliefs and emotions). To address these limitations, we propose a novel training framework, HumanLM, which builds user simulators that accurately reflect real users. Our key insight is that, in addition to generating responses, the model should generate natural-language latent states that align with ground-truth responses through reinforcement learning. These latent states correspond to a set of psychologically grounded state dimensions that drive how real users respond. HumanLM further synthesizes these aligned latent states into responses that accurately represent real users. For extensive evaluation, we develop Humanual, a comprehensive benchmark for simulating real users based on public data. Humanual consists of six large-scale datasets with 26k users and 216k responses in total, spanning diverse tasks such as generating user responses to daily life issues, political blogs, and chat sessions with LLM assistants. Across datasets, HumanLM significantly outperforms alternative approaches, achieving an average relative improvement of 16.3% in alignment scores from an LLM judge. In a real-time simulation study with 111 participants, HumanLM achieves the highest similarity to real user responses and competitive human-likeness scores.
Paper Structure (26 sections, 1 equation, 17 figures, 9 tables)

This paper contains 26 sections, 1 equation, 17 figures, 9 tables.

Figures (17)

  • Figure 1: HumanLM generates responses that capture the key points of real user responses. Given an input context (e.g., a news post) and a user profile, the model prioritizes alignment along a few psychologically grounded state dimensions (e.g., stance, emotion), that lead to how users respond. For each state dimension, the model generates the corresponding latent state (e.g., "empathy toward victims"), scored by an LLM judge for consistency with the ground-truth response. During reinforcement learning, the model maximizes alignment scores on latent states to accurately reflect real users, in addition to directly improving the responses. When generating responses, the model generates reasoning traces with aligned latent states to synthesize accurate responses.
  • Figure 2: Comparison between HumanLM and Supervised Fine-Tuning (SFT). Given a training dataset, SFT learns to capture the frequent use of emojis of the user, resulting in an inaccurate response that misses the key points in the ground-truth response (cf. Figure \ref{['fig:overview']}) during evaluation. In contrast, HumanLM explicitly learns to align along different state dimensions, generating latent states that reflect the user in the reasoning trace, which leads to a more accurate response. We apply GRPO grpo for reinforcement learning, where an LLM judge is prompted to compare a batch of generated latent states under each state dimension (aka. rollouts) and give alignment scores for them at once, providing more precise rewards under fair comparisons.
  • Figure 3: Examples (context origin=c]90 - - ground truth) from Humanual, which covers six diverse domains including simulating news comments, book reviews, opinions on daily life issues, political blogs, email replies, and follow-ups with LLM assistants.
  • Figure 4: State alignment scores ($\uparrow$) of HumanLM and two baselines on four Humanual datasets. Full results in Appendix \ref{['app:results']}.
  • Figure 5: Training dynamics comparison between HumanLM and GRPO-think. Each dot represents a model checkpoint saved every 25 steps when training on Humanual-Opinion. Each $x$ value is the checkpoint's alignment score along one of the state dimensions: belief, value, and stance. Each $y$ value is the response alignment score. Compared to GRPO, HumanLM shows broader score coverages through exploring states with explicit alignment, which encourages more optimal alignment on responses. Full results in Appendix \ref{['app:results']}.
  • ...and 12 more figures