Table of Contents
Fetching ...

Proximal Ranking Policy Optimization for Practical Safety in Counterfactual Learning to Rank

Shashank Gupta, Harrie Oosterhuis, Maarten de Rijke

TL;DR

This work proposes a novel approach, proximal ranking policy optimization (PRPO), that provides safety in deployment without assumptions about user behavior, and is the first method with unconditional safety in deployment that translates to robust safety for real-world applications.

Abstract

Counterfactual learning to rank (CLTR) can be risky and, in various circumstances, can produce sub-optimal models that hurt performance when deployed. Safe CLTR was introduced to mitigate these risks when using inverse propensity scoring to correct for position bias. However, the existing safety measure for CLTR is not applicable to state-of-the-art CLTR methods, cannot handle trust bias, and relies on specific assumptions about user behavior. We propose a novel approach, proximal ranking policy optimization (PRPO), that provides safety in deployment without assumptions about user behavior. PRPO removes incentives for learning ranking behavior that is too dissimilar to a safe ranking model. Thereby, PRPO imposes a limit on how much learned models can degrade performance metrics, without relying on any specific user assumptions. Our experiments show that PRPO provides higher performance than the existing safe inverse propensity scoring approach. PRPO always maintains safety, even in maximally adversarial situations. By avoiding assumptions, PRPO is the first method with unconditional safety in deployment that translates to robust safety for real-world applications.

Proximal Ranking Policy Optimization for Practical Safety in Counterfactual Learning to Rank

TL;DR

This work proposes a novel approach, proximal ranking policy optimization (PRPO), that provides safety in deployment without assumptions about user behavior, and is the first method with unconditional safety in deployment that translates to robust safety for real-world applications.

Abstract

Counterfactual learning to rank (CLTR) can be risky and, in various circumstances, can produce sub-optimal models that hurt performance when deployed. Safe CLTR was introduced to mitigate these risks when using inverse propensity scoring to correct for position bias. However, the existing safety measure for CLTR is not applicable to state-of-the-art CLTR methods, cannot handle trust bias, and relies on specific assumptions about user behavior. We propose a novel approach, proximal ranking policy optimization (PRPO), that provides safety in deployment without assumptions about user behavior. PRPO removes incentives for learning ranking behavior that is too dissimilar to a safe ranking model. Thereby, PRPO imposes a limit on how much learned models can degrade performance metrics, without relying on any specific user assumptions. Our experiments show that PRPO provides higher performance than the existing safe inverse propensity scoring approach. PRPO always maintains safety, even in maximally adversarial situations. By avoiding assumptions, PRPO is the first method with unconditional safety in deployment that translates to robust safety for real-world applications.
Paper Structure (6 sections, 1 theorem, 14 equations, 3 figures)

This paper contains 6 sections, 1 theorem, 14 equations, 3 figures.

Key Result

Theorem B.1

Let $q$ be a query, $\omega$ be metric weights, $y_0$ be a logging policy ranking, and $y^*(\epsilon_{-},\epsilon_{+})$ be the ranking that optimizes the PRPO objective in Eq. eq:prpo_obj. Assume that $\forall d, \in \mathcal{D}, r(d \mid q) \not= 0$. Then, for any $\Delta \in \mathbb{R}_{\geq0}$, t

Figures (3)

  • Figure 1: Clipped weight ratios of PRPO objective, as documents are moved from four different original ranks. Left: positive relevance, $r=1$; right: negative relevance, $r=-1$; x-axis: new rank for document; y-axis: unclipped weight ratios (dashed lines), $r\cdot\omega_i(d)/\omega_{i,0}(d)$; and clipped PRPO weight ratios (solid lines), $f\mleft(\omega_i(d)/\omega_{i,0}(d), \epsilon_{-} = 1.15^{-1}, \epsilon_{+}= 1.15, r=\pm1\mright)$. DCG metric weights used: $\omega_i(d) = \log_2(\textmd{rank}(d \mid q_i, \pi) + 1)^{-1}$.
  • Figure 2: Performance in terms of NDCG@5 of the IPS, DR, safe DR ($\delta=0.95$) and PRPO ($\delta(N)=\frac{100}{N}$), with the number of simulated queries in the training data ($N$) varying from $10^2$ to $10^9$.
  • Figure 3: Performance of safe DR and PRPO under an adversarial click model for varying data sizes.

Theorems & Definitions (1)

  • Theorem B.1