Table of Contents
Fetching ...

Towards Robust Offline Evaluation: A Causal and Information Theoretic Framework for Debiasing Ranking Systems

Seyedeh Baharan Khatami, Sayan Chakraborty, Ruomeng Xu, Babak Salimi

TL;DR

The paper tackles offline evaluation biases in retrieval-rankings by formulating a causal problem that isolates non-relevant bias features $X^{nr}$ from user preferences $X^r$, aiming to satisfy $C \perp E \mid X^r$. It introduces a general debiasing framework that perturbs data with weights over discretized bias groups to minimize $I(E; C \mid X^r)$ while preserving predictive utility, using the Donsker-Varadhan dual representation of mutual information and a neural estimator $T_\theta$ together with a predictive model $f_\phi$ trained via a joint loss $\mathcal{L} = \mathcal{L}_\phi + \lambda I_\theta(E; C \mid X^r)$; Bayesian optimization guides the weight selection. The framework is validated on public Coat data and internal real-time data, showing reduced dependence of clicks on bias factors and closer alignment between biased evaluations and MAR benchmarks, with improved ranking evaluation reliability and training signals. Overall, the approach provides a system-agnostic, scalable path to more accurate, fair, and generalizable offline evaluation prior to online deployment.

Abstract

Evaluating retrieval-ranking systems is crucial for developing high-performing models. While online A/B testing is the gold standard, its high cost and risks to user experience require effective offline methods. However, relying on historical interaction data introduces biases-such as selection, exposure, conformity, and position biases-that distort evaluation metrics, driven by the Missing-Not-At-Random (MNAR) nature of user interactions and favoring popular or frequently exposed items over true user preferences. We propose a novel framework for robust offline evaluation of retrieval-ranking systems, transforming MNAR data into Missing-At-Random (MAR) through reweighting combined with black-box optimization, guided by neural estimation of information-theoretic metrics. Our contributions include (1) a causal formulation for addressing offline evaluation biases, (2) a system-agnostic debiasing framework, and (3) empirical validation of its effectiveness. This framework enables more accurate, fair, and generalizable evaluations, enhancing model assessment before deployment.

Towards Robust Offline Evaluation: A Causal and Information Theoretic Framework for Debiasing Ranking Systems

TL;DR

The paper tackles offline evaluation biases in retrieval-rankings by formulating a causal problem that isolates non-relevant bias features from user preferences , aiming to satisfy . It introduces a general debiasing framework that perturbs data with weights over discretized bias groups to minimize while preserving predictive utility, using the Donsker-Varadhan dual representation of mutual information and a neural estimator together with a predictive model trained via a joint loss ; Bayesian optimization guides the weight selection. The framework is validated on public Coat data and internal real-time data, showing reduced dependence of clicks on bias factors and closer alignment between biased evaluations and MAR benchmarks, with improved ranking evaluation reliability and training signals. Overall, the approach provides a system-agnostic, scalable path to more accurate, fair, and generalizable offline evaluation prior to online deployment.

Abstract

Evaluating retrieval-ranking systems is crucial for developing high-performing models. While online A/B testing is the gold standard, its high cost and risks to user experience require effective offline methods. However, relying on historical interaction data introduces biases-such as selection, exposure, conformity, and position biases-that distort evaluation metrics, driven by the Missing-Not-At-Random (MNAR) nature of user interactions and favoring popular or frequently exposed items over true user preferences. We propose a novel framework for robust offline evaluation of retrieval-ranking systems, transforming MNAR data into Missing-At-Random (MAR) through reweighting combined with black-box optimization, guided by neural estimation of information-theoretic metrics. Our contributions include (1) a causal formulation for addressing offline evaluation biases, (2) a system-agnostic debiasing framework, and (3) empirical validation of its effectiveness. This framework enables more accurate, fair, and generalizable evaluations, enhancing model assessment before deployment.

Paper Structure

This paper contains 10 sections, 4 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Causal DAG of Random Variables: $X^r$ represents relevant user-specific item features, $X^{nr}$ represents non-relevant biasing factor feature influencing exposure, $E$ is exposure, and $C$ is click.
  • Figure 2: Framework schema: Continuous bias attributes are bucketized into user-defined bins. The bias attribute is passed to the Bayesian optimization framework, which optimizes the objective function—comprising CMI estimation and click prediction performance—to find the optimal resampling weights, defined over the bias attribute bins.
  • Figure 3: Comparing the impact of debiasing user interactions across relevance spectrum vs. down-funnel preference signals (e.g., saves)