Practical and Robust Safety Guarantees for Advanced Counterfactual Learning to Rank

Shashank Gupta; Harrie Oosterhuis; Maarten de Rijke

Practical and Robust Safety Guarantees for Advanced Counterfactual Learning to Rank

Shashank Gupta, Harrie Oosterhuis, Maarten de Rijke

TL;DR

This work tackles the safety of counterfactual learning to rank (CLTR) in production, showing that existing safe methods relying on inverse propensity scoring (IPS) and simple position-bias assumptions are insufficient for modern CLTR. It generalizes safe CLTR to work with state-of-the-art doubly robust (DR) estimators and trust bias, and introduces Proximal Ranking Policy Optimization (PRPO), an unconditional safety mechanism that does not rely on any user-model assumptions. The authors prove a DR-based safety bound, and demonstrate in semi-synthetic experiments that safe DR and PRPO outperform the prior safe IPS approach, with PRPO offering robust safety even in maximally adversarial settings. Net effect: a generalized safety framework for advanced CLTR and a practical, assumption-free method for deploying safe LTR in real-world applications.

Abstract

Counterfactual learning to rank (CLTR) can be risky and, in various circumstances, can produce sub-optimal models that hurt performance when deployed. Safe CLTR was introduced to mitigate these risks when using inverse propensity scoring to correct for position bias. However, the existing safety measure for CLTR is not applicable to state-of-the-art CLTR methods, cannot handle trust bias, and relies on specific assumptions about user behavior. Our contributions are two-fold. First, we generalize the existing safe CLTR approach to make it applicable to state-of-the-art doubly robust CLTR and trust bias. Second, we propose a novel approach, proximal ranking policy optimization (PRPO), that provides safety in deployment without assumptions about user behavior. PRPO removes incentives for learning ranking behavior that is too dissimilar to a safe ranking model. Thereby, PRPO imposes a limit on how much learned models can degrade performance metrics, without relying on any specific user assumptions. Our experiments show that both our novel safe doubly robust method and PRPO provide higher performance than the existing safe inverse propensity scoring approach. However, in unexpected circumstances, the safe doubly robust approach can become unsafe and bring detrimental performance. In contrast, PRPO always maintains safety, even in maximally adversarial situations. By avoiding assumptions, PRPO is the first method with unconditional safety in deployment that translates to robust safety for real-world applications.

Practical and Robust Safety Guarantees for Advanced Counterfactual Learning to Rank

TL;DR

Abstract

Paper Structure (17 sections, 3 theorems, 30 equations, 4 figures)

This paper contains 17 sections, 3 theorems, 30 equations, 4 figures.

Introduction
Related Work
Background
Learning to rank
Assumptions about user click behavior
Counterfactual learning to rank
Safety in counterfactual learning to rank
Proximal policy optimization
Extending Safety to Advanced CLTR
Method: Safe doubly-robust CLTR
Method: Proximal Ranking Policy Optimization (PRPO)
Experimental Setup
Results and Discussion
Conclusion
Appendix: Extended Safety Proof
...and 2 more sections

Key Result

Theorem 4.1

Given the true utility $U(\pi)$ (Eq. true-utility) and its exposure-based DR estimate $\hat{U}_{\text{DR}}(\pi)$ (Eq. cltr-obj-dr) of the ranking policy $\pi$ with the logging policy $\pi_{0}$ and the metric weights $\omega$ and $\omega_{0}$ (Eq. eq:omega and eq:omega_logging), assuming the trust bi

Figures (4)

Figure 1: Weight ratios in the clipped PRPO objective (solid lines) and the unclipped counterparts (dashed lines), as documents are moved from four different original ranks. Left: positive relevance, $r=1$; right: negative relevance, $r=-1$; x-axis: new rank for document; y-axis: unclipped weight ratios (dashed lines), $r\cdot\omega_i(d)/\omega_{i,0}(d)$; and clipped PRPO weight ratios (solid lines), $f\mleft(\omega_i(d)/\omega_{i,0}(d), \epsilon_{-} = 1.15^{-1}, \epsilon_{+}= 1.15, r=\pm1\mright)$. DCG metric weights used: $\omega_i(d) = \log_2(\textmd{rank}(d \mid q_i, \pi) + 1)^{-1}$.
Figure 2: Performance in terms of NDCG@5 of the IPS, DR and proposed safe DR ($\delta=0.95$) and PRPO ($\delta(N)=\frac{100}{N}$) methods for CLTR. The results are presented varying size of training data ($N$), with number of simulated queries varying from $10^2$ to $10^9$. Results are averaged over 10 runs; the shaded areas indicate 80% prediction intervals.
Figure 3: Performance of the safe DR and PRPO with varying safety parameter ($\delta$). Top row: sensitivity analysis of PRPO with varying clipping parameter ($\delta$) over varying dataset sizes $N$. Bottom row: sensitivity analysis for the safe DR method with varying safety confidence parameter ($\delta$). Results are averaged over 10 runs; shaded areas indicate $80\%$ prediction intervals.
Figure 4: Performance of the proposed safe DR and PRPO with the adversarial click model. Top: sensitivity analysis results for the PRPO method with varying clipping parameter ($\delta$). Bottom: sensitivity analysis for the safe DR method with varying safety confidence parameter ($\delta$). Results are averaged over 10 independent runs; the shaded areas indicate $80\%$ prediction intervals.

Theorems & Definitions (3)

Theorem 4.1
Theorem 5.1
Lemma A.1

Practical and Robust Safety Guarantees for Advanced Counterfactual Learning to Rank

TL;DR

Abstract

Practical and Robust Safety Guarantees for Advanced Counterfactual Learning to Rank

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (3)