Table of Contents
Fetching ...

Robust Reinforcement Learning Objectives for Sequential Recommender Systems

Melissa Mozifian, Tristan Sylvain, Dave Evans, Lili Meng

TL;DR

This work addresses the instability of reinforcement-learning-based sequential recommendation by integrating a transformer encoder with two stability-enhancing components: conservative Q-learning and contrastive learning using temporal augmentations. The proposed SASRec-CCQL framework, and its simpler variant SASRec-CO, demonstrate state-of-the-art performance across multiple real-world datasets while maintaining training robustness against negative sampling and distributional shift. Key contributions include a formal RL formulation for recommendations, a conservative Q-learning objective tailored to missing or negative actions, and a batch-wise contrastive objective that strengthens representation learning. The findings suggest that offline RL with these stability mechanisms can meaningfully improve long-horizon personalization in recommender systems, with practical implications for deploying RL-based RS in real-world settings.

Abstract

Attention-based sequential recommendation methods have shown promise in accurately capturing users' evolving interests from their past interactions. Recent research has also explored the integration of reinforcement learning (RL) into these models, in addition to generating superior user representations. By framing sequential recommendation as an RL problem with reward signals, we can develop recommender systems that incorporate direct user feedback in the form of rewards, enhancing personalization for users. Nonetheless, employing RL algorithms presents challenges, including off-policy training, expansive combinatorial action spaces, and the scarcity of datasets with sufficient reward signals. Contemporary approaches have attempted to combine RL and sequential modeling, incorporating contrastive-based objectives and negative sampling strategies for training the RL component. In this work, we further emphasize the efficacy of contrastive-based objectives paired with augmentation to address datasets with extended horizons. Additionally, we recognize the potential instability issues that may arise during the application of negative sampling. These challenges primarily stem from the data imbalance prevalent in real-world datasets, which is a common issue in offline RL contexts. Furthermore, we introduce an enhanced methodology aimed at providing a more effective solution to these challenges. Experimental results across several real datasets show our method with increased robustness and state-of-the-art performance.

Robust Reinforcement Learning Objectives for Sequential Recommender Systems

TL;DR

This work addresses the instability of reinforcement-learning-based sequential recommendation by integrating a transformer encoder with two stability-enhancing components: conservative Q-learning and contrastive learning using temporal augmentations. The proposed SASRec-CCQL framework, and its simpler variant SASRec-CO, demonstrate state-of-the-art performance across multiple real-world datasets while maintaining training robustness against negative sampling and distributional shift. Key contributions include a formal RL formulation for recommendations, a conservative Q-learning objective tailored to missing or negative actions, and a batch-wise contrastive objective that strengthens representation learning. The findings suggest that offline RL with these stability mechanisms can meaningfully improve long-horizon personalization in recommender systems, with practical implications for deploying RL-based RS in real-world settings.

Abstract

Attention-based sequential recommendation methods have shown promise in accurately capturing users' evolving interests from their past interactions. Recent research has also explored the integration of reinforcement learning (RL) into these models, in addition to generating superior user representations. By framing sequential recommendation as an RL problem with reward signals, we can develop recommender systems that incorporate direct user feedback in the form of rewards, enhancing personalization for users. Nonetheless, employing RL algorithms presents challenges, including off-policy training, expansive combinatorial action spaces, and the scarcity of datasets with sufficient reward signals. Contemporary approaches have attempted to combine RL and sequential modeling, incorporating contrastive-based objectives and negative sampling strategies for training the RL component. In this work, we further emphasize the efficacy of contrastive-based objectives paired with augmentation to address datasets with extended horizons. Additionally, we recognize the potential instability issues that may arise during the application of negative sampling. These challenges primarily stem from the data imbalance prevalent in real-world datasets, which is a common issue in offline RL contexts. Furthermore, we introduce an enhanced methodology aimed at providing a more effective solution to these challenges. Experimental results across several real datasets show our method with increased robustness and state-of-the-art performance.
Paper Structure (21 sections, 4 equations, 15 figures, 12 tables)

This paper contains 21 sections, 4 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: Enhanced stability and performance on RetailRocket purchase prediction with SASRec-CCQL, an approach that combines contrastive learning and RL-based objectives.
  • Figure 2: Model architecture for the training process and the interaction between the transformer model and Q-learning with the proposed objectives. The Conservative Q-learning (CQL) objective considers positive samples (green) and hard negative action sampling (red), while the contrastive objective is applied batch-wise across different user items (green vs orange). For more details refer to Sec.\ref{['sec:method']}.
  • Figure 3: Our method SASRec-CCQL outperforms other approaches in predicting purchases for both Top-20 and Top-5 recommendations.
  • Figure 4: Purchase predictions comparisons on Top-20 for varying negative samplings. These results demonstrate higher performance is achieved and remains stable with increasing negative samples, unlike baseline methods SNQN and SA2C, which exhibit performance decline and divergence.
  • Figure 5: Purchase predictions comparisons on Top-5 for varying negative samplings. As we increase the rate of negative samples during training, we observe performance drop in our baseline SNQN and divergence with SA2C and SA2C with smoothing i.e. off-policy correction enabled.
  • ...and 10 more figures