Table of Contents
Fetching ...

SAPIENT: Mastering Multi-turn Conversational Recommendation with Strategic Planning and Monte Carlo Tree Search

Hanwen Du, Bo Peng, Xia Ning

TL;DR

This paper tackles the challenge of non-myopic planning in multi-turn conversational recommender systems (MCR) by introducing SAPIENT, a framework that couples an RL-based S-agent with an MCTS-based S-planner to plan conversations strategically.S-planner simulates future conversations to maximize cumulative rewards, and its best plans mentor the S-agent through a self-training loop, enabling the agent to acquire planning expertise for inference without ongoing planner calls.An efficient variant, SAPIENT-\textit{e}, trains on all S-planner trajectories via a listwise ranking objective, achieving similar performance with lower trajectory-collection cost.Extensive experiments on four benchmark datasets show that both SAPIENT and SAPIENT-\textit{e} outperform 9 strong baselines in SR, AT, and hDCG, validating the effectiveness of targeted, non-myopic conversational strategies for information gathering and item recommendation.

Abstract

Conversational Recommender Systems (CRS) proactively engage users in interactive dialogues to elicit user preferences and provide personalized recommendations. Existing methods train Reinforcement Learning (RL)-based agent with greedy action selection or sampling strategy, and may suffer from suboptimal conversational planning. To address this, we present a novel Monte Carlo Tree Search (MCTS)-based CRS framework SAPIENT. SAPIENT consists of a conversational agent (S-agent) and a conversational planner (S-planner). S-planner builds a conversational search tree with MCTS based on the initial actions proposed by S-agent to find conversation plans. The best conversation plans from S-planner are used to guide the training of S-agent, creating a self-training loop where S-agent can iteratively improve its capability for conversational planning. Furthermore, we propose an efficient variant SAPIENT for trade-off between training efficiency and performance. Extensive experiments on four benchmark datasets validate the effectiveness of our approach, showing that SAPIENT outperforms the state-of-the-art baselines. Our code and data are accessible through https://github.com/ninglab/SAPIENT.

SAPIENT: Mastering Multi-turn Conversational Recommendation with Strategic Planning and Monte Carlo Tree Search

TL;DR

This paper tackles the challenge of non-myopic planning in multi-turn conversational recommender systems (MCR) by introducing SAPIENT, a framework that couples an RL-based S-agent with an MCTS-based S-planner to plan conversations strategically.S-planner simulates future conversations to maximize cumulative rewards, and its best plans mentor the S-agent through a self-training loop, enabling the agent to acquire planning expertise for inference without ongoing planner calls.An efficient variant, SAPIENT-\textit{e}, trains on all S-planner trajectories via a listwise ranking objective, achieving similar performance with lower trajectory-collection cost.Extensive experiments on four benchmark datasets show that both SAPIENT and SAPIENT-\textit{e} outperform 9 strong baselines in SR, AT, and hDCG, validating the effectiveness of targeted, non-myopic conversational strategies for information gathering and item recommendation.

Abstract

Conversational Recommender Systems (CRS) proactively engage users in interactive dialogues to elicit user preferences and provide personalized recommendations. Existing methods train Reinforcement Learning (RL)-based agent with greedy action selection or sampling strategy, and may suffer from suboptimal conversational planning. To address this, we present a novel Monte Carlo Tree Search (MCTS)-based CRS framework SAPIENT. SAPIENT consists of a conversational agent (S-agent) and a conversational planner (S-planner). S-planner builds a conversational search tree with MCTS based on the initial actions proposed by S-agent to find conversation plans. The best conversation plans from S-planner are used to guide the training of S-agent, creating a self-training loop where S-agent can iteratively improve its capability for conversational planning. Furthermore, we propose an efficient variant SAPIENT for trade-off between training efficiency and performance. Extensive experiments on four benchmark datasets validate the effectiveness of our approach, showing that SAPIENT outperforms the state-of-the-art baselines. Our code and data are accessible through https://github.com/ninglab/SAPIENT.

Paper Structure

This paper contains 33 sections, 16 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: An example of conversational search tree for a user. Conversation starts at the root node with the user specifying preference on an attribute type and its value. The search tree expands as $\mathop{\mathtt{S}\text{-agent}}\limits$ decides different action types---$\mathtt{ask}$ and $\mathtt{rec}$---at each turn. Red line connects the highest-rewarded conversation plan found by the tree.
  • Figure 2: $\mathop{\mathtt{SAPIENT}}\limits$ consists of a conversational agent ($\mathop{\mathtt{S}\text{-agent}}\limits$) and a conversational planner ($\mathop{\mathtt{S}\text{-planner}}\limits$). $\mathop{\mathtt{S}\text{-planner}}\limits$ leverages MCTS to perform non-myopic conversational planning based on the heuristics from $\mathop{\mathtt{S}\text{-agent}}\limits$. The best conversation plans found by $\mathop{\mathtt{S}\text{-planner}}\limits$ are used to guide the training of $\mathop{\mathtt{S}\text{-agent}}\limits$, enabling $\mathop{\mathtt{S}\text{-agent}}\limits$ to engage in a self-training loop that iteratively improves its capability for conversational planning.
  • Figure A1: An illustration of the state $s_t=(\mathcal{P}^{+}_{t},\mathcal{P}^{-}_{t},\mathcal{V}^{-}_{t})$, which include all the attribute values $\mathcal{P}^{+}_{t}$ that the user has accepted, all the attribute values $\mathcal{P}^{-}_{t}$ that the user has rejected, and all the items $\mathcal{V}^{-}_{t}$ that the user has rejected until the $t$-th turn. Note that in this example, "looking for a medium price range" at the start of the conversation infers that all the other price ranges (low, high and premium) are not acceptable.
  • Figure A2: Success rate under different exploration factor $w$.
  • Figure A3: Success rate and training time (per 100 gradient descent steps) under different rollout number $N$. The dotted lines represent the success rate, and the bar charts represent the training time.
  • ...and 2 more figures