Table of Contents
Fetching ...

Retentive Decision Transformer with Adaptive Masking for Reinforcement Learning based Recommendation Systems

Siyu Wang, Xiaocong Chen, Lina Yao

TL;DR

This paper tackles offline reinforcement learning for recommender systems, addressing reward design and data efficiency for long, variable-length interaction histories. It introduces MaskRDT, a RetNet-based framework that uses adaptive causal masking to expose variable history lengths and a multi-scale segmented retention mechanism to model long sequences efficiently. By recasting sequential decision-making as an inference task and training with expert trajectories, MaskRDT achieves superior performance and training efficiency on both online simulations and offline datasets, with robust results across diverse domains and trajectory sizes. The approach offers practical impact for scalable, data-efficient RL-based recommendations in real-world systems by balancing predictive accuracy, computational cost, and adaptability to user behavior dynamics.

Abstract

Reinforcement Learning-based Recommender Systems (RLRS) have shown promise across a spectrum of applications, from e-commerce platforms to streaming services. Yet, they grapple with challenges, notably in crafting reward functions and harnessing large pre-existing datasets within the RL framework. Recent advancements in offline RLRS provide a solution for how to address these two challenges. However, existing methods mainly rely on the transformer architecture, which, as sequence lengths increase, can introduce challenges associated with computational resources and training costs. Additionally, the prevalent methods employ fixed-length input trajectories, restricting their capacity to capture evolving user preferences. In this study, we introduce a new offline RLRS method to deal with the above problems. We reinterpret the RLRS challenge by modeling sequential decision-making as an inference task, leveraging adaptive masking configurations. This adaptive approach selectively masks input tokens, transforming the recommendation task into an inference challenge based on varying token subsets, thereby enhancing the agent's ability to infer across diverse trajectory lengths. Furthermore, we incorporate a multi-scale segmented retention mechanism that facilitates efficient modeling of long sequences, significantly enhancing computational efficiency. Our experimental analysis, conducted on both online simulator and offline datasets, clearly demonstrates the advantages of our proposed method.

Retentive Decision Transformer with Adaptive Masking for Reinforcement Learning based Recommendation Systems

TL;DR

This paper tackles offline reinforcement learning for recommender systems, addressing reward design and data efficiency for long, variable-length interaction histories. It introduces MaskRDT, a RetNet-based framework that uses adaptive causal masking to expose variable history lengths and a multi-scale segmented retention mechanism to model long sequences efficiently. By recasting sequential decision-making as an inference task and training with expert trajectories, MaskRDT achieves superior performance and training efficiency on both online simulations and offline datasets, with robust results across diverse domains and trajectory sizes. The approach offers practical impact for scalable, data-efficient RL-based recommendations in real-world systems by balancing predictive accuracy, computational cost, and adaptability to user behavior dynamics.

Abstract

Reinforcement Learning-based Recommender Systems (RLRS) have shown promise across a spectrum of applications, from e-commerce platforms to streaming services. Yet, they grapple with challenges, notably in crafting reward functions and harnessing large pre-existing datasets within the RL framework. Recent advancements in offline RLRS provide a solution for how to address these two challenges. However, existing methods mainly rely on the transformer architecture, which, as sequence lengths increase, can introduce challenges associated with computational resources and training costs. Additionally, the prevalent methods employ fixed-length input trajectories, restricting their capacity to capture evolving user preferences. In this study, we introduce a new offline RLRS method to deal with the above problems. We reinterpret the RLRS challenge by modeling sequential decision-making as an inference task, leveraging adaptive masking configurations. This adaptive approach selectively masks input tokens, transforming the recommendation task into an inference challenge based on varying token subsets, thereby enhancing the agent's ability to infer across diverse trajectory lengths. Furthermore, we incorporate a multi-scale segmented retention mechanism that facilitates efficient modeling of long sequences, significantly enhancing computational efficiency. Our experimental analysis, conducted on both online simulator and offline datasets, clearly demonstrates the advantages of our proposed method.
Paper Structure (25 sections, 24 equations, 4 figures, 2 tables)

This paper contains 25 sections, 24 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The comprehensive MaskRDT architecture. Starting from the left, states, actions and RTGs are transformed through linear embeddings, with an added absolute positional embedding. This trajectory segmentation, post-masking based on a predefined configuration, is fed into the initial retention block. The middle of the figure is the retention mechanism, where the masked trajectory is partitioned into $S$ sub-segments. Computations within each segment are parallel, while recurrent retention computations bridge the segments. On the right, the causal layer emerges post the $L$-th block, producing two distinct representations directed into separate prediction layers. Crowning the architecture are two networks: $N_e$ for reward estimation and $N_g$ for action prediction.
  • Figure 2: Overall comparison result with variance between the baselines and CDT4Rec in the VirtualTaobao simulation environment.
  • Figure 3: Performance Comparison Between MaskRDT and CDT4Rec for Different Context Lengths
  • Figure 4: Performance Comparison Between MaskRDT, CDT4Rec and DT for Different Numbers of Trajectories