Robust Reinforcement Learning Objectives for Sequential Recommender Systems
Melissa Mozifian, Tristan Sylvain, Dave Evans, Lili Meng
TL;DR
This work addresses the instability of reinforcement-learning-based sequential recommendation by integrating a transformer encoder with two stability-enhancing components: conservative Q-learning and contrastive learning using temporal augmentations. The proposed SASRec-CCQL framework, and its simpler variant SASRec-CO, demonstrate state-of-the-art performance across multiple real-world datasets while maintaining training robustness against negative sampling and distributional shift. Key contributions include a formal RL formulation for recommendations, a conservative Q-learning objective tailored to missing or negative actions, and a batch-wise contrastive objective that strengthens representation learning. The findings suggest that offline RL with these stability mechanisms can meaningfully improve long-horizon personalization in recommender systems, with practical implications for deploying RL-based RS in real-world settings.
Abstract
Attention-based sequential recommendation methods have shown promise in accurately capturing users' evolving interests from their past interactions. Recent research has also explored the integration of reinforcement learning (RL) into these models, in addition to generating superior user representations. By framing sequential recommendation as an RL problem with reward signals, we can develop recommender systems that incorporate direct user feedback in the form of rewards, enhancing personalization for users. Nonetheless, employing RL algorithms presents challenges, including off-policy training, expansive combinatorial action spaces, and the scarcity of datasets with sufficient reward signals. Contemporary approaches have attempted to combine RL and sequential modeling, incorporating contrastive-based objectives and negative sampling strategies for training the RL component. In this work, we further emphasize the efficacy of contrastive-based objectives paired with augmentation to address datasets with extended horizons. Additionally, we recognize the potential instability issues that may arise during the application of negative sampling. These challenges primarily stem from the data imbalance prevalent in real-world datasets, which is a common issue in offline RL contexts. Furthermore, we introduce an enhanced methodology aimed at providing a more effective solution to these challenges. Experimental results across several real datasets show our method with increased robustness and state-of-the-art performance.
