Table of Contents
Fetching ...

Maximum-Entropy Regularized Decision Transformer with Reward Relabelling for Dynamic Recommendation

Xiaocong Chen, Siyu Wang, Lina Yao

TL;DR

This work tackles two challenges in DT-based offline RL for dynamic recommender systems: stitching sub-optimal trajectories and insufficient online exploration. It introduces EDT4Rec, which combines max-entropy exploration with a novel reward relabeling strategy guided by conservative Q-learning to leverage sub-trajectory segments and improve online adaptation. Empirical results across six real-world offline datasets and online simulations show that EDT4Rec outperforms existing DT-based offline RL methods and traditional RL baselines, with ablations confirming the critical roles of exploration and relabeling. The approach advances practical offline-to-online reinforcement learning for dynamic recommendation, offering improved performance in data-sparse, evolving user environments.

Abstract

Reinforcement learning-based recommender systems have recently gained popularity. However, due to the typical limitations of simulation environments (e.g., data inefficiency), most of the work cannot be broadly applied in all domains. To counter these challenges, recent advancements have leveraged offline reinforcement learning methods, notable for their data-driven approach utilizing offline datasets. A prominent example of this is the Decision Transformer. Despite its popularity, the Decision Transformer approach has inherent drawbacks, particularly evident in recommendation methods based on it. This paper identifies two key shortcomings in existing Decision Transformer-based methods: a lack of stitching capability and limited effectiveness in online adoption. In response, we introduce a novel methodology named Max-Entropy enhanced Decision Transformer with Reward Relabeling for Offline RLRS (EDT4Rec). Our approach begins with a max entropy perspective, leading to the development of a max entropy enhanced exploration strategy. This strategy is designed to facilitate more effective exploration in online environments. Additionally, to augment the model's capability to stitch sub-optimal trajectories, we incorporate a unique reward relabeling technique. To validate the effectiveness and superiority of EDT4Rec, we have conducted comprehensive experiments across six real-world offline datasets and in an online simulator.

Maximum-Entropy Regularized Decision Transformer with Reward Relabelling for Dynamic Recommendation

TL;DR

This work tackles two challenges in DT-based offline RL for dynamic recommender systems: stitching sub-optimal trajectories and insufficient online exploration. It introduces EDT4Rec, which combines max-entropy exploration with a novel reward relabeling strategy guided by conservative Q-learning to leverage sub-trajectory segments and improve online adaptation. Empirical results across six real-world offline datasets and online simulations show that EDT4Rec outperforms existing DT-based offline RL methods and traditional RL baselines, with ablations confirming the critical roles of exploration and relabeling. The approach advances practical offline-to-online reinforcement learning for dynamic recommendation, offering improved performance in data-sparse, evolving user environments.

Abstract

Reinforcement learning-based recommender systems have recently gained popularity. However, due to the typical limitations of simulation environments (e.g., data inefficiency), most of the work cannot be broadly applied in all domains. To counter these challenges, recent advancements have leveraged offline reinforcement learning methods, notable for their data-driven approach utilizing offline datasets. A prominent example of this is the Decision Transformer. Despite its popularity, the Decision Transformer approach has inherent drawbacks, particularly evident in recommendation methods based on it. This paper identifies two key shortcomings in existing Decision Transformer-based methods: a lack of stitching capability and limited effectiveness in online adoption. In response, we introduce a novel methodology named Max-Entropy enhanced Decision Transformer with Reward Relabeling for Offline RLRS (EDT4Rec). Our approach begins with a max entropy perspective, leading to the development of a max entropy enhanced exploration strategy. This strategy is designed to facilitate more effective exploration in online environments. Additionally, to augment the model's capability to stitch sub-optimal trajectories, we incorporate a unique reward relabeling technique. To validate the effectiveness and superiority of EDT4Rec, we have conducted comprehensive experiments across six real-world offline datasets and in an online simulator.
Paper Structure (15 sections, 1 theorem, 10 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 15 sections, 1 theorem, 10 equations, 4 figures, 1 table, 2 algorithms.

Key Result

Theorem 1

The value of the policy under the Q-function from eq:cql, $\hat{V}_\pi(\textbf{s}) = \mathbb{E}_{\pi(\textbf{a}|\textbf{s})}[\hat{Q}_\pi(\textbf{s},\textbf{a})]$, lower-bounds the true value of the policy obtained via exact policy evaluation.

Figures (4)

  • Figure 1: An example demonstrates that when DT is directly applied to RS will face the stitching problem (i.e., cannot learn from the sub-optimal trajectory).
  • Figure 2: The overall structure of the proposed EDT4Rec. The backbone is the causal decision transformer.
  • Figure 3: (a). Overall comparison result with variance between the baselines and EDT4Rec in the VirtualTaobao simulation environment. (b).Hyperparameter $g_{online}$ Study, the value reported in the average CTR over $100,000$ timesteps. (c).Hyperparameter $K$ Study, the value reported in the average CTR over $100,000$ timesteps
  • Figure 4: Ablation Study

Theorems & Definitions (1)

  • Theorem 1: Lower Bound of CQL kumar2020conservative