Table of Contents
Fetching ...

Mind the Sim2Real Gap in User Simulation for Agentic Tasks

Xuhui Zhou, Weiwei Sun, Qianou Ma, Yiqing Xie, Jiarui Liu, Weihua Du, Sean Welleck, Yiming Yang, Graham Neubig, Sherry Tongshuang Wu, Maarten Sap

TL;DR

The Sim2Real gap in user simulation is formalized, the first study running the full $\tau$-bench protocol with real humans is presented, and the User-Sim Index (USI) is introduced, a metric to quantify how well LLM simulators resemble real user interactive behaviors and feedback.

Abstract

As NLP evaluation shifts from static benchmarks to multi-turn interactive settings, LLM-based simulators have become widely used as user proxies, serving two roles: generating user turns and providing evaluation signals. Yet, these simulations are frequently assumed to be faithful to real human behaviors, often without rigorous verification. We formalize the Sim2Real gap in user simulation and present the first study running the full $τ$-bench protocol with real humans (451 participants, 165 tasks), benchmarking 31 LLM simulators across proprietary, open-source, and specialized families using the User-Sim Index (USI), a metric we introduce to quantify how well LLM simulators resemble real user interactive behaviors and feedback. Behaviorally, LLM simulators are excessively cooperative, stylistically uniform, and lack realistic frustration or ambiguity, creating an "easy mode" that inflates agent success rates above the human baseline. In evaluations, real humans provide nuanced judgments across eight quality dimensions while simulated users produce uniformly more positive feedback; rule-based rewards are failing to capture rich feedback signals generated by human users. Overall, higher general model capability does not necessarily yield more faithful user simulation. These findings highlight the importance of human validation when using LLM-based user simulators in the agent development cycle and motivate improved models for user simulation.

Mind the Sim2Real Gap in User Simulation for Agentic Tasks

TL;DR

The Sim2Real gap in user simulation is formalized, the first study running the full -bench protocol with real humans is presented, and the User-Sim Index (USI) is introduced, a metric to quantify how well LLM simulators resemble real user interactive behaviors and feedback.

Abstract

As NLP evaluation shifts from static benchmarks to multi-turn interactive settings, LLM-based simulators have become widely used as user proxies, serving two roles: generating user turns and providing evaluation signals. Yet, these simulations are frequently assumed to be faithful to real human behaviors, often without rigorous verification. We formalize the Sim2Real gap in user simulation and present the first study running the full -bench protocol with real humans (451 participants, 165 tasks), benchmarking 31 LLM simulators across proprietary, open-source, and specialized families using the User-Sim Index (USI), a metric we introduce to quantify how well LLM simulators resemble real user interactive behaviors and feedback. Behaviorally, LLM simulators are excessively cooperative, stylistically uniform, and lack realistic frustration or ambiguity, creating an "easy mode" that inflates agent success rates above the human baseline. In evaluations, real humans provide nuanced judgments across eight quality dimensions while simulated users produce uniformly more positive feedback; rule-based rewards are failing to capture rich feedback signals generated by human users. Overall, higher general model capability does not necessarily yield more faithful user simulation. These findings highlight the importance of human validation when using LLM-based user simulators in the agent development cycle and motivate improved models for user simulation.
Paper Structure (46 sections, 3 equations, 8 figures, 4 tables)

This paper contains 46 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: User-Sim Index (USI) vs. Chatbot Arena Elo Score for LLM simulators. Solid lines and shaded regions show per-family linear regression with 80% confidence bands; error bars denote standard deviation across three annotator batches. Besides GPT-series, other LLMs' general capability does not reliably translate to faithful user simulation.
  • Figure 2: Taxonomy of Sim2Real gaps in user simulation. We highlight the dimensions where the gaps between humans and all LLMs are significant. See Appendix §\ref{['app:behavioral_metric_definitions']} for exact operational definitions of the behavioral metrics.
  • Figure 3: Per-metric behavioral comparison for selected models (GPT-4o, Qwen3-235B, CoSER, UserLM-8b) and human users on $\tau$-bench tasks. Metrics are grouped into four dimensions: Communication Styles (D1), Clarification (D3), Information Pattern (D2), and Error Reaction (D4). Human values appear as dark bars. Red-outlined metrics indicate large divergence from human behavior. Full results for all models are in Table \ref{['tab:behavioral_divergence']}.
  • Figure 4: Score distributions for human annotators and GPT-5.1 across quality dimensions ($n{=}165{\times}3$ batches), with mean differences ($\Delta$ = LLM $-$ Human). The LLM evaluator is lenient on interaction quality but conservative on task success.
  • Figure 5: Per-dimension human quality ratings by reward group ($n{=}495$). Stacked distributions for reward=0 and reward=1 are nearly indistinguishable across all eight dimensions, confirming that the binary reward captures none of these quality aspects.
  • ...and 3 more figures