Table of Contents
Fetching ...

An Efficient Continuous Control Perspective for Reinforcement-Learning-based Sequential Recommendation

Jun Wang, Likang Wu, Qi Liu, Yu Yang

TL;DR

The paper tackles sequential recommendation by formulating offline reinforcement learning in a continuous-control setting. It introduces ECoC, an Efficient Continuous Control framework that abstracts actions from normalized user and item spaces into unit vectors, enabling stable offline training through strategic exploration and dual conservatism regularization. Key contributions include the unified action representation, a tailored critic-actor objective with L_REG and L_BC, and a constrained directional policy gradient for offline optimization, validated across three real-world datasets with improved imitation and off-policy performance and reduced training costs. The results demonstrate that continuous-action, unified-representation control can outperform discrete RL baselines while maintaining training efficiency and robustness, offering practical impact for scalable, privacy-conscious recommender systems.

Abstract

Sequential recommendation, where user preference is dynamically inferred from sequential historical behaviors, is a critical task in recommender systems (RSs). To further optimize long-term user engagement, offline reinforcement-learning-based RSs have become a mainstream technique as they provide an additional advantage in avoiding global explorations that may harm online users' experiences. However, previous studies mainly focus on discrete action and policy spaces, which might have difficulties in handling dramatically growing items efficiently. To mitigate this issue, in this paper, we aim to design an algorithmic framework applicable to continuous policies. To facilitate the control in the low-dimensional but dense user preference space, we propose an \underline{\textbf{E}}fficient \underline{\textbf{Co}}ntinuous \underline{\textbf{C}}ontrol framework (ECoC). Based on a statistically tested assumption, we first propose the novel unified action representation abstracted from normalized user and item spaces. Then, we develop the corresponding policy evaluation and policy improvement procedures. During this process, strategic exploration and directional control in terms of unified actions are carefully designed and crucial to final recommendation decisions. Moreover, beneficial from unified actions, the conservatism regularization for policies and value functions are combined and perfectly compatible with the continuous framework. The resulting dual regularization ensures the successful offline training of RL-based recommendation policies. Finally, we conduct extensive experiments to validate the effectiveness of our framework. The results show that compared to the discrete baselines, our ECoC is trained far more efficiently. Meanwhile, the final policies outperform baselines in both capturing the offline data and gaining long-term rewards.

An Efficient Continuous Control Perspective for Reinforcement-Learning-based Sequential Recommendation

TL;DR

The paper tackles sequential recommendation by formulating offline reinforcement learning in a continuous-control setting. It introduces ECoC, an Efficient Continuous Control framework that abstracts actions from normalized user and item spaces into unit vectors, enabling stable offline training through strategic exploration and dual conservatism regularization. Key contributions include the unified action representation, a tailored critic-actor objective with L_REG and L_BC, and a constrained directional policy gradient for offline optimization, validated across three real-world datasets with improved imitation and off-policy performance and reduced training costs. The results demonstrate that continuous-action, unified-representation control can outperform discrete RL baselines while maintaining training efficiency and robustness, offering practical impact for scalable, privacy-conscious recommender systems.

Abstract

Sequential recommendation, where user preference is dynamically inferred from sequential historical behaviors, is a critical task in recommender systems (RSs). To further optimize long-term user engagement, offline reinforcement-learning-based RSs have become a mainstream technique as they provide an additional advantage in avoiding global explorations that may harm online users' experiences. However, previous studies mainly focus on discrete action and policy spaces, which might have difficulties in handling dramatically growing items efficiently. To mitigate this issue, in this paper, we aim to design an algorithmic framework applicable to continuous policies. To facilitate the control in the low-dimensional but dense user preference space, we propose an \underline{\textbf{E}}fficient \underline{\textbf{Co}}ntinuous \underline{\textbf{C}}ontrol framework (ECoC). Based on a statistically tested assumption, we first propose the novel unified action representation abstracted from normalized user and item spaces. Then, we develop the corresponding policy evaluation and policy improvement procedures. During this process, strategic exploration and directional control in terms of unified actions are carefully designed and crucial to final recommendation decisions. Moreover, beneficial from unified actions, the conservatism regularization for policies and value functions are combined and perfectly compatible with the continuous framework. The resulting dual regularization ensures the successful offline training of RL-based recommendation policies. Finally, we conduct extensive experiments to validate the effectiveness of our framework. The results show that compared to the discrete baselines, our ECoC is trained far more efficiently. Meanwhile, the final policies outperform baselines in both capturing the offline data and gaining long-term rewards.
Paper Structure (41 sections, 1 theorem, 30 equations, 10 figures, 6 tables)

This paper contains 41 sections, 1 theorem, 30 equations, 10 figures, 6 tables.

Key Result

Corollary 5.1

Under an empirical MDP $\widehat{\mathcal{M}}$ derived from logged data $\mathcal{H}$, with deterministic policy $\pi_{\theta}$ (parameterized by $\theta$) from the constrained policy space $\left\{ \pi \mid \left(s, \pi(a) \right) \in \mathcal{H} \right\}$ and the accompanying value function $Q^{\p

Figures (10)

  • Figure 1: Two implementation manners for sequential recommendation. The discrete version is presented in the dashed box on the left-hand side whereas the continuous version is illustrated in the dashed box on the right-hand side.
  • Figure 2: Cosine values of angles between $\mathrm{e}^i$ and $\mathrm{p}^{\mathrm{u}}$ when the ground-truth item $i$ ranks in top-K.
  • Figure 3: The existing discrete version of actor-critic recommendation framework (Left) and our ECoC framework (Right), which is designed for continuous action spaces and policies. The key difference lies in the utilization of the item embedding matrix, which leads to the unified evaluation and directional control.
  • Figure 4: An illustration of unified evaluation (L) and conservatism regularization (R) when $d=2$.
  • Figure 5: The workflow of off-policy evaluation.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Corollary 5.1