Table of Contents
Fetching ...

Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data

Yuxuan Lu, Jing Huang, Yan Han, Bingsheng Yao, Sisong Bei, Jiri Gesi, Yaochen Xie, Zheshen, Wang, Qi He, Dakuo Wang

TL;DR

The paper tackles the problem of accurately simulating multi-turn human shopping behavior at the action level. It introduces a process-centric next-action task augmented with synthesized reasoning traces, and builds a large real-world dataset to benchmark prompt-based versus fine-tuned LLM agents. The key contribution is showing that out-of-the-box models underperform significantly, while fine-tuning with real click-through data and reasoning traces yields substantial gains in both action generation and final outcome prediction. This work provides a rigorous benchmark and actionable guidance for developing more faithful LLM agents in interactive domains.

Abstract

Recent research shows that LLM Agents can generate ``believable'' human behaviors via prompt-only methods, and such agents have been increasingly adopted in downstream applications. However, existing evaluation of these agents only focuses on qualitative believability (whether human raters think they are accurate), leaving open questions of whether LLM agents can accurately generate step-by-step actions mimicking a particular human's behavior in a multi-turn interaction task. In this work, we take shopping as a case study and present the first large-scale quantitative evaluation of state-of-the-art LLMs' ability to accurately simulate human behavior. Using real-world data from 31,865 online shopping sessions containing 230,965 user actions, our evaluation reveals that prompt-based LLMs (DeepSeek-R1, Llama, Claude) achieve only 11.86% accuracy in generating human actions, highlighting a substantial gap in actual behavioral accuracy. Through experiments, we also showcase that strategies as simple as fine-tuning LLMs on real human click-through data augmented with synthesized reasoning traces can greatly enhance models' performance. The fine-tuned Qwen2.5-7B achieves 17.26% action generation accuracy and 33.86% F1 score on final purchase prediction, representing substantial improvements of 5.4% and 13.85% over prompt-only baselines. This work establishes the first rigorous benchmark for human behavior simulation and provides actionable insights for developing more accurate LLM agents for future downstream applications.

Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data

TL;DR

The paper tackles the problem of accurately simulating multi-turn human shopping behavior at the action level. It introduces a process-centric next-action task augmented with synthesized reasoning traces, and builds a large real-world dataset to benchmark prompt-based versus fine-tuned LLM agents. The key contribution is showing that out-of-the-box models underperform significantly, while fine-tuning with real click-through data and reasoning traces yields substantial gains in both action generation and final outcome prediction. This work provides a rigorous benchmark and actionable guidance for developing more faithful LLM agents in interactive domains.

Abstract

Recent research shows that LLM Agents can generate ``believable'' human behaviors via prompt-only methods, and such agents have been increasingly adopted in downstream applications. However, existing evaluation of these agents only focuses on qualitative believability (whether human raters think they are accurate), leaving open questions of whether LLM agents can accurately generate step-by-step actions mimicking a particular human's behavior in a multi-turn interaction task. In this work, we take shopping as a case study and present the first large-scale quantitative evaluation of state-of-the-art LLMs' ability to accurately simulate human behavior. Using real-world data from 31,865 online shopping sessions containing 230,965 user actions, our evaluation reveals that prompt-based LLMs (DeepSeek-R1, Llama, Claude) achieve only 11.86% accuracy in generating human actions, highlighting a substantial gap in actual behavioral accuracy. Through experiments, we also showcase that strategies as simple as fine-tuning LLMs on real human click-through data augmented with synthesized reasoning traces can greatly enhance models' performance. The fine-tuned Qwen2.5-7B achieves 17.26% action generation accuracy and 33.86% F1 score on final purchase prediction, representing substantial improvements of 5.4% and 13.85% over prompt-only baselines. This work establishes the first rigorous benchmark for human behavior simulation and provides actionable insights for developing more accurate LLM agents for future downstream applications.

Paper Structure

This paper contains 28 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the next action prediction task. The model takes the currently observed $\langle$context$\rangle_{t}$ and a sequence of previous $\langle$context, reasoning, action$\rangle_{1:t-1}$ as input, and generates the next $\langle$reasoning, action$\rangle_{t}$ as output. Because the real-world human behavior dataset does not have groundtruth reasoning, we generate synthesized reasoning trace to complement the $\langle$context, action$\rangle$ pair.
  • Figure 2: Action categories of human groundtruth, and generated by prompt-based Claude and by our fine-tuned models.
  • Figure 3: Error Type Analysis of different models.