Towards Robust Offline Evaluation: A Causal and Information Theoretic Framework for Debiasing Ranking Systems
Seyedeh Baharan Khatami, Sayan Chakraborty, Ruomeng Xu, Babak Salimi
TL;DR
The paper tackles offline evaluation biases in retrieval-rankings by formulating a causal problem that isolates non-relevant bias features $X^{nr}$ from user preferences $X^r$, aiming to satisfy $C \perp E \mid X^r$. It introduces a general debiasing framework that perturbs data with weights over discretized bias groups to minimize $I(E; C \mid X^r)$ while preserving predictive utility, using the Donsker-Varadhan dual representation of mutual information and a neural estimator $T_\theta$ together with a predictive model $f_\phi$ trained via a joint loss $\mathcal{L} = \mathcal{L}_\phi + \lambda I_\theta(E; C \mid X^r)$; Bayesian optimization guides the weight selection. The framework is validated on public Coat data and internal real-time data, showing reduced dependence of clicks on bias factors and closer alignment between biased evaluations and MAR benchmarks, with improved ranking evaluation reliability and training signals. Overall, the approach provides a system-agnostic, scalable path to more accurate, fair, and generalizable offline evaluation prior to online deployment.
Abstract
Evaluating retrieval-ranking systems is crucial for developing high-performing models. While online A/B testing is the gold standard, its high cost and risks to user experience require effective offline methods. However, relying on historical interaction data introduces biases-such as selection, exposure, conformity, and position biases-that distort evaluation metrics, driven by the Missing-Not-At-Random (MNAR) nature of user interactions and favoring popular or frequently exposed items over true user preferences. We propose a novel framework for robust offline evaluation of retrieval-ranking systems, transforming MNAR data into Missing-At-Random (MAR) through reweighting combined with black-box optimization, guided by neural estimation of information-theoretic metrics. Our contributions include (1) a causal formulation for addressing offline evaluation biases, (2) a system-agnostic debiasing framework, and (3) empirical validation of its effectiveness. This framework enables more accurate, fair, and generalizable evaluations, enhancing model assessment before deployment.
