An Offline Metric for the Debiasedness of Click Models
Romain Deffayet, Philipp Hager, Jean-Michel Renders, Maarten de Rijke
TL;DR
This paper addresses the failure of traditional click-model evaluations to generalize under covariate shift by introducing the notion of debiasedness and a practical metric, CMIP, based on conditional mutual information. CMIP quantifies how much a trained click model’s relevance scores correlate with the logging policy beyond the true relevance signal, and is estimated via KL-divergence between a joint distribution and a constructed marginal where dependence on the logging policy is removed. Through semi-synthetic experiments on real data, CMIP demonstrates predictive power for downstream out-of-distribution performance and reduces regret in off-policy model selection when combined with existing metrics like PPL and nDCG. The work provides a principled, distribution-aware offline evaluation framework and releases code to enable robust model selection for downstream tasks such as counterfactual learning-to-rank and fair ranking.
Abstract
A well-known problem when learning from user clicks are inherent biases prevalent in the data, such as position or trust bias. Click models are a common method for extracting information from user clicks, such as document relevance in web search, or to estimate click biases for downstream applications such as counterfactual learning-to-rank, ad placement, or fair ranking. Recent work shows that the current evaluation practices in the community fail to guarantee that a well-performing click model generalizes well to downstream tasks in which the ranking distribution differs from the training distribution, i.e., under covariate shift. In this work, we propose an evaluation metric based on conditional independence testing to detect a lack of robustness to covariate shift in click models. We introduce the concept of debiasedness in click modeling and derive a metric for measuring it. In extensive semi-synthetic experiments, we show that our proposed metric helps to predict the downstream performance of click models under covariate shift and is useful in an off-policy model selection setting.
