Table of Contents
Fetching ...

Customer-R1: Personalized Simulation of Human Behaviors via RL-based LLM Agent in Online Shopping

Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, Dakuo Wang

TL;DR

The paper tackles the challenge of simulating personalized, step-wise user behavior in online shopping by introducing Customer-R1, an RL-based LLM agent conditioned on explicit user personas. It defines a next-action generation task with rationale and uses a verifiable reward combining action correctness and formatting, optimized via Group Relative Policy Optimization (GRPO). Empirical results on the OPeRA dataset show that SFT followed by RL (initialized from SFT) yields the best performance, significantly surpassing prompting and SFT baselines and achieving action distributions that better reflect user-specific personas. Ablation studies demonstrate that both persona information and rationale generation are crucial for stabilization and personalization, with persona guiding end-of-session decisions and rationale aiding credit assignment. The work advances personalized behavioral simulation with practical implications for usability testing, targeted recommendations, and interface design in e-commerce.

Abstract

Simulating step-wise human behavior with Large Language Models (LLMs) has become an emerging research direction, enabling applications in various practical domains. While prior methods, including prompting, supervised fine-tuning (SFT), and reinforcement learning (RL), have shown promise in modeling step-wise behavior, they primarily learn a population-level policy without conditioning on a user's persona, yielding generic rather than personalized simulations. In this work, we pose a critical question: how can LLM agents better simulate personalized user behavior? We introduce Customer-R1, an RL-based method for personalized, step-wise user behavior simulation in online shopping environments. Our policy is conditioned on an explicit persona, and we optimize next-step rationale and action generation via action correctness reward signals. Experiments on the OPeRA dataset emonstrate that Customer-R1 not only significantly outperforms prompting and SFT-based baselines in next-action prediction tasks, but also better matches users' action distribution, indicating higher fidelity in personalized behavior simulation.

Customer-R1: Personalized Simulation of Human Behaviors via RL-based LLM Agent in Online Shopping

TL;DR

The paper tackles the challenge of simulating personalized, step-wise user behavior in online shopping by introducing Customer-R1, an RL-based LLM agent conditioned on explicit user personas. It defines a next-action generation task with rationale and uses a verifiable reward combining action correctness and formatting, optimized via Group Relative Policy Optimization (GRPO). Empirical results on the OPeRA dataset show that SFT followed by RL (initialized from SFT) yields the best performance, significantly surpassing prompting and SFT baselines and achieving action distributions that better reflect user-specific personas. Ablation studies demonstrate that both persona information and rationale generation are crucial for stabilization and personalization, with persona guiding end-of-session decisions and rationale aiding credit assignment. The work advances personalized behavioral simulation with practical implications for usability testing, targeted recommendations, and interface design in e-commerce.

Abstract

Simulating step-wise human behavior with Large Language Models (LLMs) has become an emerging research direction, enabling applications in various practical domains. While prior methods, including prompting, supervised fine-tuning (SFT), and reinforcement learning (RL), have shown promise in modeling step-wise behavior, they primarily learn a population-level policy without conditioning on a user's persona, yielding generic rather than personalized simulations. In this work, we pose a critical question: how can LLM agents better simulate personalized user behavior? We introduce Customer-R1, an RL-based method for personalized, step-wise user behavior simulation in online shopping environments. Our policy is conditioned on an explicit persona, and we optimize next-step rationale and action generation via action correctness reward signals. Experiments on the OPeRA dataset emonstrate that Customer-R1 not only significantly outperforms prompting and SFT-based baselines in next-action prediction tasks, but also better matches users' action distribution, indicating higher fidelity in personalized behavior simulation.

Paper Structure

This paper contains 20 sections, 6 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: User Behavior Simulation in Online Shopping. The model observes a sequence of historical user actions and learns to reason over this behavioral context to predict the user’s next action.
  • Figure 2: Customer-R1 Framework for Simulating User Behavior in Online Shopping. The model observes user history behaviors in a session composed of HTML observations $o_1, \dots,o_{t-1}$, actions $a_1, \dots,a_{t-1}$, rationales $r_1, \dots,r_{t-1}$, along with real user persona $P$ (demographics, personality, and shopping preferences). At time step $t$, given the current HTML observation $o_t$, the model predicts the rationale $r'_t$ for conducting an action and the corresponding next action $a'_t$. During training, the model samples $n$ rollouts per step. For each sampled prediction, a reward is calculated by comparing the predicted action $a'_t$ with the ground-truth action $a_t$ based on action correctness and format validity. These rewards are aggregated and used for policy optimization.
  • Figure 3: Fine-grained action distribution. a) Model trained using RL only. b) Model trained using SFT+RL. c) Model trained using SFT+RL without persona