Table of Contents
Fetching ...

ChatShopBuddy: Towards Reliable Conversational Shopping Agents via Reinforcement Learning

Yiruo Cheng, Kelong Mao, Tianhao Li, Jiejun Tan, Ji-Rong Wen, Zhicheng Dou

TL;DR

Extensive experiments demonstrate that the RL-trained ChatShopBuddy agent, namely ChatShopBuddy, consistently outperforms larger models relying on generic reasoning, achieving superior stability rather than merely higher peaks.

Abstract

Conversational shopping agents represent a critical consumer-facing application of Large Language Model (LLM)-powered agents, yet how to effectively apply post-training Reinforcement Learning (RL) to optimize such agents remains underexplored. This work investigates RL-based optimization for shopping agents in real-world scenarios, where agents must simultaneously satisfy multiple interdependent objectives spanning objective metrics (product correctness), subjective qualities (persuasiveness), outcome rewards (final response quality), and process rewards (tool efficiency). We present a complete methodology to address this challenge. Specifically, we first construct SmartShopBench, a benchmark that captures diverse shopping intents with a hierarchical evaluation that decomposes complex quality requirements into measurable levels. Building on this evaluation framework, we design Hierarchical Reward Modeling (HRM) to structure mixed reward types through conditional gating that reflects their logical dependencies. To enable efficient training, we further propose Dynamic Contrastive Policy Optimization (DCPO), which balances response quality with operational efficiency through dynamic trajectory selection based on reward and reasoning length. Extensive experiments demonstrate that our RL-trained agent, namely ChatShopBuddy, consistently outperforms larger models relying on generic reasoning, achieving superior stability rather than merely higher peaks. Our work provides valuable guidance for applying RL to real-world conversational agents.

ChatShopBuddy: Towards Reliable Conversational Shopping Agents via Reinforcement Learning

TL;DR

Extensive experiments demonstrate that the RL-trained ChatShopBuddy agent, namely ChatShopBuddy, consistently outperforms larger models relying on generic reasoning, achieving superior stability rather than merely higher peaks.

Abstract

Conversational shopping agents represent a critical consumer-facing application of Large Language Model (LLM)-powered agents, yet how to effectively apply post-training Reinforcement Learning (RL) to optimize such agents remains underexplored. This work investigates RL-based optimization for shopping agents in real-world scenarios, where agents must simultaneously satisfy multiple interdependent objectives spanning objective metrics (product correctness), subjective qualities (persuasiveness), outcome rewards (final response quality), and process rewards (tool efficiency). We present a complete methodology to address this challenge. Specifically, we first construct SmartShopBench, a benchmark that captures diverse shopping intents with a hierarchical evaluation that decomposes complex quality requirements into measurable levels. Building on this evaluation framework, we design Hierarchical Reward Modeling (HRM) to structure mixed reward types through conditional gating that reflects their logical dependencies. To enable efficient training, we further propose Dynamic Contrastive Policy Optimization (DCPO), which balances response quality with operational efficiency through dynamic trajectory selection based on reward and reasoning length. Extensive experiments demonstrate that our RL-trained agent, namely ChatShopBuddy, consistently outperforms larger models relying on generic reasoning, achieving superior stability rather than merely higher peaks. Our work provides valuable guidance for applying RL to real-world conversational agents.
Paper Structure (40 sections, 6 equations, 9 figures, 3 tables)

This paper contains 40 sections, 6 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Example of a conversational shopping agent.
  • Figure 2: Overview of our conversational shopping agent. Figure(a) is the inference process of our shopping agent. Figure(b) is the Hierarchical Reward Modeling. The green gate indicates whether the L1 Grader passes, the purple gate indicates whether the L2 Grader passes, and the red gate denotes the computation of the process reward. Figure(c) is the end-to-end RL training process with DCPO. We use dynamic contrastive selection strategy choose K/2 trajectories from a total of K.
  • Figure 3: Example trajectory of the shopping agent.
  • Figure 4: Performance of ChatShopBuddy across six shopping categories. We report Avg@4 over four independent runs.
  • Figure 5: Comparison of reasoning length during the training process DCPO vs. GRPO.
  • ...and 4 more figures