Prioritized Ranking Experimental Design Using Recommender Systems in Two-Sided Platforms
Mahyar Habibi, Zahra Khanalizadeh, Negar Ziaeian
TL;DR
This work tackles interference in item-side experiments on two-sided marketplaces and the consequent bias in standard A/B tests. It introduces Two-Sided Prioritized Ranking (TSPR), a ranking-based experimental design that uses the recommender system as an instrument to estimate the Total Average Treatment Effect ($TATE$) while preserving access to all items and maintaining consistent treatment realizations across users. The methodology partitions items into Treated, Untreated, and Placebo, assigns queries to two strata with distinct ranking priorities, and uses partial outcomes $Y^l$ to aggregate exposure effects across ranks; the estimator combines information across ranks with a pre-experiment baseline and bootstrapped standard errors. Empirical validation via semi-synthetic Expedia hotel data shows that TSPR produces near-ground-truth $TATE$ estimates (e.g., $-0.047$) and reduced bias compared with a naive baseline ($-0.091$), highlighting its practical relevance for real-world platforms seeking causal insights while preserving user experience.
Abstract
Interdependencies between units in online two-sided marketplaces complicate estimating causal effects in experimental settings. We propose a novel experimental design to mitigate the interference bias in estimating the total average treatment effect (TATE) of item-side interventions in online two-sided marketplaces. Our Two-Sided Prioritized Ranking (TSPR) design uses the recommender system as an instrument for experimentation. TSPR strategically prioritizes items based on their treatment status in the listings displayed to users. We designed TSPR to provide users with a coherent platform experience by ensuring access to all items and a consistent realization of their treatment by all users. We evaluate our experimental design through simulations using a search impression dataset from an online travel agency. Our methodology closely estimates the true simulated TATE, while a baseline item-side estimator significantly overestimates TATE.
