Table of Contents
Fetching ...

Towards Off-Policy Reinforcement Learning for Ranking Policies with Human Feedback

Teng Xiao, Suhang Wang

TL;DR

This paper tackles maximizing long-term rewards in ranking-based recommender systems by proposing an off-policy value ranking (VR) algorithm grounded in an Expectation-Maximization framework that unifies probabilistic ranking (MLE) with reinforcement learning. It introduces reward extrapolation and ranking regularization to address sparse, partial rewards and extends the EM approach to sequential decision making, yielding a practical VR algorithm that learns without online interactions. Theoretical results show VR reduces overestimation bias and bounds variance from importance sampling, while extensive offline and online experiments across multiple backbones and datasets demonstrate VR's superiority over MLE and existing off-policy methods in both ranking accuracy and long-term reward optimization. The approach has strong implications for robust, sample-efficient offline RL in ranking settings and shows practical improvements in real-world-like evaluation environments such as RecSim.

Abstract

Probabilistic learning to rank (LTR) has been the dominating approach for optimizing the ranking metric, but cannot maximize long-term rewards. Reinforcement learning models have been proposed to maximize user long-term rewards by formulating the recommendation as a sequential decision-making problem, but could only achieve inferior accuracy compared to LTR counterparts, primarily due to the lack of online interactions and the characteristics of ranking. In this paper, we propose a new off-policy value ranking (VR) algorithm that can simultaneously maximize user long-term rewards and optimize the ranking metric offline for improved sample efficiency in a unified Expectation-Maximization (EM) framework. We theoretically and empirically show that the EM process guides the leaned policy to enjoy the benefit of integration of the future reward and ranking metric, and learn without any online interactions. Extensive offline and online experiments demonstrate the effectiveness of our methods.

Towards Off-Policy Reinforcement Learning for Ranking Policies with Human Feedback

TL;DR

This paper tackles maximizing long-term rewards in ranking-based recommender systems by proposing an off-policy value ranking (VR) algorithm grounded in an Expectation-Maximization framework that unifies probabilistic ranking (MLE) with reinforcement learning. It introduces reward extrapolation and ranking regularization to address sparse, partial rewards and extends the EM approach to sequential decision making, yielding a practical VR algorithm that learns without online interactions. Theoretical results show VR reduces overestimation bias and bounds variance from importance sampling, while extensive offline and online experiments across multiple backbones and datasets demonstrate VR's superiority over MLE and existing off-policy methods in both ranking accuracy and long-term reward optimization. The approach has strong implications for robust, sample-efficient offline RL in ranking settings and shows practical improvements in real-world-like evaluation environments such as RecSim.

Abstract

Probabilistic learning to rank (LTR) has been the dominating approach for optimizing the ranking metric, but cannot maximize long-term rewards. Reinforcement learning models have been proposed to maximize user long-term rewards by formulating the recommendation as a sequential decision-making problem, but could only achieve inferior accuracy compared to LTR counterparts, primarily due to the lack of online interactions and the characteristics of ranking. In this paper, we propose a new off-policy value ranking (VR) algorithm that can simultaneously maximize user long-term rewards and optimize the ranking metric offline for improved sample efficiency in a unified Expectation-Maximization (EM) framework. We theoretically and empirically show that the EM process guides the leaned policy to enjoy the benefit of integration of the future reward and ranking metric, and learn without any online interactions. Extensive offline and online experiments demonstrate the effectiveness of our methods.
Paper Structure (16 sections, 23 equations, 5 figures, 4 tables)

This paper contains 16 sections, 23 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Training curves and overestimation bias of MLE, our VR, DQN, and two variants of DQN.
  • Figure 2: Training curves on the multi-objective setting.
  • Figure 3: The overestimation bias on two datasets.
  • Figure 4: Performance with various discount factor $\gamma$.
  • Figure 5: Performance (NDCG@5) with various $\alpha$ and $\beta$.