Table of Contents
Fetching ...

Value Function Decomposition in Markov Recommendation Process

Xiaobei Wang, Shuchang Liu, Qingpeng Cai, Xiang Li, Lantao Hu, Han li, Guangming Xie

TL;DR

This paper addresses long-horizon optimization in RL-based recommenders by identifying a key source of TD estimation noise: the mixing of policy-driven action exploration and stochastic user responses. It introduces a TD Decomposition framework that separately learns a state value $V(s)$ and a state-action value $Q(s,a)$ via two decoupled losses, $\mathcal{L}_{\text{stateTD}}$ and $\mathcal{L}_{\text{actionTD}}$, with a debiasing term $\beta$ to align with the current policy. Empirical results in simulated online environments built on public datasets show that decomposition improves recommendation performance, accelerates learning, and increases robustness to exploration across multiple TD-based backbones (e.g., A2C, DQN, HAC). The method demonstrates stronger gains in harder domains and provides a general technique that can be applied to various TD-based RL frameworks, with potential extensions to value–actor interactions. The work contributes a principled approach to disentangling randomness in online RL for recommendations, enabling more reliable long-term optimization in complex user environments.

Abstract

Recent advances in recommender systems have shown that user-system interaction essentially formulates long-term optimization problems, and online reinforcement learning can be adopted to improve recommendation performance. The general solution framework incorporates a value function that estimates the user's expected cumulative rewards in the future and guides the training of the recommendation policy. To avoid local maxima, the policy may explore potential high-quality actions during inference to increase the chance of finding better future rewards. To accommodate the stepwise recommendation process, one widely adopted approach to learning the value function is learning from the difference between the values of two consecutive states of a user. However, we argue that this paradigm involves a challenge of Mixing Random Factors: there exist two random factors from the stochastic policy and the uncertain user environment, but they are not separately modeled in the standard temporal difference (TD) learning, which may result in a suboptimal estimation of the long-term rewards and less effective action exploration. As a solution, we show that these two factors can be separately approximated by decomposing the original temporal difference loss. The disentangled learning framework can achieve a more accurate estimation with faster learning and improved robustness against action exploration. As an empirical verification of our proposed method, we conduct offline experiments with simulated online environments built on the basis of public datasets.

Value Function Decomposition in Markov Recommendation Process

TL;DR

This paper addresses long-horizon optimization in RL-based recommenders by identifying a key source of TD estimation noise: the mixing of policy-driven action exploration and stochastic user responses. It introduces a TD Decomposition framework that separately learns a state value and a state-action value via two decoupled losses, and , with a debiasing term to align with the current policy. Empirical results in simulated online environments built on public datasets show that decomposition improves recommendation performance, accelerates learning, and increases robustness to exploration across multiple TD-based backbones (e.g., A2C, DQN, HAC). The method demonstrates stronger gains in harder domains and provides a general technique that can be applied to various TD-based RL frameworks, with potential extensions to value–actor interactions. The work contributes a principled approach to disentangling randomness in online RL for recommendations, enabling more reliable long-term optimization in complex user environments.

Abstract

Recent advances in recommender systems have shown that user-system interaction essentially formulates long-term optimization problems, and online reinforcement learning can be adopted to improve recommendation performance. The general solution framework incorporates a value function that estimates the user's expected cumulative rewards in the future and guides the training of the recommendation policy. To avoid local maxima, the policy may explore potential high-quality actions during inference to increase the chance of finding better future rewards. To accommodate the stepwise recommendation process, one widely adopted approach to learning the value function is learning from the difference between the values of two consecutive states of a user. However, we argue that this paradigm involves a challenge of Mixing Random Factors: there exist two random factors from the stochastic policy and the uncertain user environment, but they are not separately modeled in the standard temporal difference (TD) learning, which may result in a suboptimal estimation of the long-term rewards and less effective action exploration. As a solution, we show that these two factors can be separately approximated by decomposing the original temporal difference loss. The disentangled learning framework can achieve a more accurate estimation with faster learning and improved robustness against action exploration. As an empirical verification of our proposed method, we conduct offline experiments with simulated online environments built on the basis of public datasets.

Paper Structure

This paper contains 27 sections, 19 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The proposed decomposition method on the RL backbone (i.e., HAC) in KuaiRand dataset improves the overall performance and is more robust to exploration of recommendation actions.
  • Figure 2: The general Markov Recommendation Process. Standard TD approaches (left) either adopt $Q$-based or $V$-based TD. Our solution (right) decomposes the learning into two objectives for random policy and stochastic user environment respectively.
  • Figure 3: Performance between original and decomposed TD method on A2C in KuaiRand dataset.
  • Figure 4: Performance between original and decomposed TD method on HAC in KuaiRand dataset.
  • Figure 5: The effect of action exploration in HAC. X-axis represents the magnitude of action exploration $\sigma$. The shaded area represents the standard deviation of 5-round experiments with random seeds.
  • ...and 4 more figures