Table of Contents
Fetching ...

An Offline Metric for the Debiasedness of Click Models

Romain Deffayet, Philipp Hager, Jean-Michel Renders, Maarten de Rijke

TL;DR

This paper addresses the failure of traditional click-model evaluations to generalize under covariate shift by introducing the notion of debiasedness and a practical metric, CMIP, based on conditional mutual information. CMIP quantifies how much a trained click model’s relevance scores correlate with the logging policy beyond the true relevance signal, and is estimated via KL-divergence between a joint distribution and a constructed marginal where dependence on the logging policy is removed. Through semi-synthetic experiments on real data, CMIP demonstrates predictive power for downstream out-of-distribution performance and reduces regret in off-policy model selection when combined with existing metrics like PPL and nDCG. The work provides a principled, distribution-aware offline evaluation framework and releases code to enable robust model selection for downstream tasks such as counterfactual learning-to-rank and fair ranking.

Abstract

A well-known problem when learning from user clicks are inherent biases prevalent in the data, such as position or trust bias. Click models are a common method for extracting information from user clicks, such as document relevance in web search, or to estimate click biases for downstream applications such as counterfactual learning-to-rank, ad placement, or fair ranking. Recent work shows that the current evaluation practices in the community fail to guarantee that a well-performing click model generalizes well to downstream tasks in which the ranking distribution differs from the training distribution, i.e., under covariate shift. In this work, we propose an evaluation metric based on conditional independence testing to detect a lack of robustness to covariate shift in click models. We introduce the concept of debiasedness in click modeling and derive a metric for measuring it. In extensive semi-synthetic experiments, we show that our proposed metric helps to predict the downstream performance of click models under covariate shift and is useful in an off-policy model selection setting.

An Offline Metric for the Debiasedness of Click Models

TL;DR

This paper addresses the failure of traditional click-model evaluations to generalize under covariate shift by introducing the notion of debiasedness and a practical metric, CMIP, based on conditional mutual information. CMIP quantifies how much a trained click model’s relevance scores correlate with the logging policy beyond the true relevance signal, and is estimated via KL-divergence between a joint distribution and a constructed marginal where dependence on the logging policy is removed. Through semi-synthetic experiments on real data, CMIP demonstrates predictive power for downstream out-of-distribution performance and reduces regret in off-policy model selection when combined with existing metrics like PPL and nDCG. The work provides a principled, distribution-aware offline evaluation framework and releases code to enable robust model selection for downstream tasks such as counterfactual learning-to-rank and fair ranking.

Abstract

A well-known problem when learning from user clicks are inherent biases prevalent in the data, such as position or trust bias. Click models are a common method for extracting information from user clicks, such as document relevance in web search, or to estimate click biases for downstream applications such as counterfactual learning-to-rank, ad placement, or fair ranking. Recent work shows that the current evaluation practices in the community fail to guarantee that a well-performing click model generalizes well to downstream tasks in which the ranking distribution differs from the training distribution, i.e., under covariate shift. In this work, we propose an evaluation metric based on conditional independence testing to detect a lack of robustness to covariate shift in click models. We introduce the concept of debiasedness in click modeling and derive a metric for measuring it. In extensive semi-synthetic experiments, we show that our proposed metric helps to predict the downstream performance of click models under covariate shift and is useful in an off-policy model selection setting.
Paper Structure (37 sections, 3 theorems, 14 equations, 2 figures, 2 tables)

This paper contains 37 sections, 3 theorems, 14 equations, 2 figures, 2 tables.

Key Result

theorem 1

A click model that is invariant under policy shift is debiasing. For every dataset $\mathcal{D}$ and ranking $y$:

Figures (2)

  • Figure 1: Comparing the relevance estimates of two click models (DCTR and PBM) against the relevance estimates of an almost optimal logging policy (NoisyOracle, defined in Section \ref{['stochastic-policies']}) for 1.5k documents, grouped by their true relevance. Clicks follow a PBM user model. The DCTR model achieves a higher nDCG but correlates notably with the logging policy, resulting in a high CMIP. In contrast to the PBM, the DCTR model is not debiased in this setup. Note that CMIP is in theory a non-negative metric but approximations can make it slightly negative.
  • Figure 2: Comparing the performance of click models. Our proposed metric, CMIP, helps predict out-of-distribution results. All models are trained on a PBM user model and a NoisyOracle logging policy and evaluated under a uniform policy. We average results over ten independent runs and we display the 95% confidence interval.

Theorems & Definitions (6)

  • definition 1
  • definition 2
  • definition 3
  • theorem 1
  • corollary 1
  • corollary 2