Future Impact Decomposition in Request-level Recommendations

Xiaobei Wang; Shuchang Liu; Xueliang Wang; Qingpeng Cai; Lantao Hu; Han Li; Peng Jiang; Kun Gai; Guangming Xie

Future Impact Decomposition in Request-level Recommendations

Xiaobei Wang, Shuchang Liu, Xueliang Wang, Qingpeng Cai, Lantao Hu, Han Li, Peng Jiang, Kun Gai, Guangming Xie

TL;DR

The paper addresses the mismatch between item-level user behavior and list-wise actions in RL-based recommender systems by introducing ItemA2C, an item-wise decomposition of the A2C framework under a request-level MDP with item-wise rewards. It adds two future-impact reweighting mechanisms: a simple, parameterized alpha-based weighting and a learnable adversarial weight model that produces per-item weights w_{t,k} while preserving the overall list target via normalization, ensuring $\sum_{i\in a_t} \Psi_w(s_t,i)=\Psi(s_t,a_t)$. Extensive offline evaluation on ML1M and KuaiRand demonstrates that item-wise learning, particularly the model-based reweighting variant ItemA2C-M, yields higher long-term reward and deeper sessions than baselines such as HAC and SlateQ; online A/B tests on a large-scale video platform corroborate these gains across engagement metrics. The findings support item-level future impact attribution as a practical, deployable enhancement for list-wise recommendation, with potential extensions to other RL paradigms and reward designs.

Abstract

In recommender systems, reinforcement learning solutions have shown promising results in optimizing the interaction sequence between users and the system over the long-term performance. For practical reasons, the policy's actions are typically designed as recommending a list of items to handle users' frequent and continuous browsing requests more efficiently. In this list-wise recommendation scenario, the user state is updated upon every request in the corresponding MDP formulation. However, this request-level formulation is essentially inconsistent with the user's item-level behavior. In this study, we demonstrate that an item-level optimization approach can better utilize item characteristics and optimize the policy's performance even under the request-level MDP. We support this claim by comparing the performance of standard request-level methods with the proposed item-level actor-critic framework in both simulation and online experiments. Furthermore, we show that a reward-based future decomposition strategy can better express the item-wise future impact and improve the recommendation accuracy in the long term. To achieve a more thorough understanding of the decomposition strategy, we propose a model-based re-weighting framework with adversarial learning that further boost the performance and investigate its correlation with the reward-based strategy.

Future Impact Decomposition in Request-level Recommendations

TL;DR

. Extensive offline evaluation on ML1M and KuaiRand demonstrates that item-wise learning, particularly the model-based reweighting variant ItemA2C-M, yields higher long-term reward and deeper sessions than baselines such as HAC and SlateQ; online A/B tests on a large-scale video platform corroborate these gains across engagement metrics. The findings support item-level future impact attribution as a practical, deployable enhancement for list-wise recommendation, with potential extensions to other RL paradigms and reward designs.

Abstract

Paper Structure (29 sections, 12 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 29 sections, 12 equations, 8 figures, 4 tables, 1 algorithm.

Introduction
Related Work
List-wise Recommendation
Reinforcement Learning for Recommendation
Method
Problem Formulation
Request-level A2C
Item-wise Decomposition of A2C
Weighting Item-wise Future Impact
Model-based Future Impact Re-weighting
Overall Item-level Learning Framework
Experiments
Offline Experiments with Simulator
Datasets and Online Simulator
Evaluation Protocol
...and 14 more sections

Figures (8)

Figure 1: Request-level MDP with observable item-level reward. A special case is K=1 that is equivalent to item-level MDP.
Figure 2: The ItemA2C learning framework, where $\odot$ repesents the score function items and $\oplus$ represents summation.
Figure 3: Learning curves of all methods.
Figure 4: Training curves of all methods on KuaiRand. We also show the loss curves about critic and actor.
Figure 5: The Trends of Cosine Similarity and Pearson Correlation Coefficient between weight model and heuristic re-weighting during training.
...and 3 more figures

Future Impact Decomposition in Request-level Recommendations

TL;DR

Abstract

Future Impact Decomposition in Request-level Recommendations

Authors

TL;DR

Abstract

Table of Contents

Figures (8)