Pessimistic Off-Policy Optimization for Learning to Rank

Matej Cief; Branislav Kveton; Michal Kompan

Pessimistic Off-Policy Optimization for Learning to Rank

Matej Cief, Branislav Kveton, Michal Kompan

TL;DR

Pessimistic off-policy optimization for learning to rank is studied, to compute lower confidence bounds on parameters of click models and then return the list with the highest pessimistic estimate of its value.

Abstract

Off-policy learning is a framework for optimizing policies without deploying them, using data collected by another policy. In recommender systems, this is especially challenging due to the imbalance in logged data: some items are recommended and thus logged more frequently than others. This is further perpetuated when recommending a list of items, as the action space is combinatorial. To address this challenge, we study pessimistic off-policy optimization for learning to rank. The key idea is to compute lower confidence bounds on parameters of click models and then return the list with the highest pessimistic estimate of its value. This approach is computationally efficient, and we analyze it. We study its Bayesian and frequentist variants and overcome the limitation of unknown prior by incorporating empirical Bayes. To show the empirical effectiveness of our approach, we compare it to off-policy optimizers that use inverse propensity scores or neglect uncertainty. Our approach outperforms all baselines and is both robust and general.

Pessimistic Off-Policy Optimization for Learning to Rank

TL;DR

Abstract

Paper Structure (42 sections, 4 theorems, 28 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 42 sections, 4 theorems, 28 equations, 10 figures, 2 tables, 1 algorithm.

Related Work
Off-Policy Optimization
Counterfactual Learning to Rank
Pessimistic Off-Policy Optimization
Setting
Pessimistic Optimization
Structured Pessimism
Cascade Model
Dependent-Click Model
Position-Based Model
Lower Confidence Bounds on Attraction Probabilities
Bayesian Lower Confidence Bounds
Frequentist Lower Confidence Bounds
Prior Estimation
Analysis
...and 27 more sections

Key Result

Lemma 1

Let Then for any item $a \in \mathcal{E}$ and context $X \in \mathcal{X}$, $\left|\hat{\theta}_{a, X} - \theta_{a, X}\right| \leq c(a, X)$ holds with probability at least $1 - \delta$.

Figures (10)

Figure 1: Comparison of our methods to baselines on three click models and top 10 queries. We vary the parameter $\delta$ that sets the confidence interval width. MLE is the grey dashed line.
Figure 2: Comparison of our methods to MLE when increasing the sample size $n$.
Figure 3: Robustness evaluation of our methods and baselines.
Figure 4: Comparison of our methods to baselines on the Yahoo! Webscope dataset.
Figure 5: Comparison of our methods to baselines on the MSLR-WEB30k dataset.
...and 5 more figures

Theorems & Definitions (6)

Lemma 1: Concentration of item estimates
Lemma 2: Concentration of list estimates
Theorem 3: Error of acting pessimistically
Lemma 4
proof
proof

Pessimistic Off-Policy Optimization for Learning to Rank

TL;DR

Abstract

Pessimistic Off-Policy Optimization for Learning to Rank

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (6)