Customer-R1: Personalized Simulation of Human Behaviors via RL-based LLM Agent in Online Shopping
Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, Dakuo Wang
TL;DR
The paper tackles the challenge of simulating personalized, step-wise user behavior in online shopping by introducing Customer-R1, an RL-based LLM agent conditioned on explicit user personas. It defines a next-action generation task with rationale and uses a verifiable reward combining action correctness and formatting, optimized via Group Relative Policy Optimization (GRPO). Empirical results on the OPeRA dataset show that SFT followed by RL (initialized from SFT) yields the best performance, significantly surpassing prompting and SFT baselines and achieving action distributions that better reflect user-specific personas. Ablation studies demonstrate that both persona information and rationale generation are crucial for stabilization and personalization, with persona guiding end-of-session decisions and rationale aiding credit assignment. The work advances personalized behavioral simulation with practical implications for usability testing, targeted recommendations, and interface design in e-commerce.
Abstract
Simulating step-wise human behavior with Large Language Models (LLMs) has become an emerging research direction, enabling applications in various practical domains. While prior methods, including prompting, supervised fine-tuning (SFT), and reinforcement learning (RL), have shown promise in modeling step-wise behavior, they primarily learn a population-level policy without conditioning on a user's persona, yielding generic rather than personalized simulations. In this work, we pose a critical question: how can LLM agents better simulate personalized user behavior? We introduce Customer-R1, an RL-based method for personalized, step-wise user behavior simulation in online shopping environments. Our policy is conditioned on an explicit persona, and we optimize next-step rationale and action generation via action correctness reward signals. Experiments on the OPeRA dataset emonstrate that Customer-R1 not only significantly outperforms prompting and SFT-based baselines in next-action prediction tasks, but also better matches users' action distribution, indicating higher fidelity in personalized behavior simulation.
