Post Launch Evaluation of Policies in a High-Dimensional Setting
Shima Nassiri, Mohsen Bayati, Joe Cooprider
TL;DR
This work addresses post-launch evaluation in ultra-high-dimensional settings where traditional A/B testing is costly or impractical. It proposes a two-phase synthetic-control–inspired framework: first filter the donor pool with nearest-neighbor matching on unit covariates, then apply high-dimensional vertical regression to predict counterfactual outcomes $\,\hat{Y}_{it}(0)$ and estimate the average treatment effect $ au$ and heterogeneous effects $ au_{it}$. Across six large experiments, the two-phase method improves donor-pool alignment and counterfactual accuracy, but machine-learning bias can distort effect sizes, motivating a debiasing strategy that combines relative-error minimization with a bias penalty (with $ ext{alpha}\napprox 20$) to yield closer agreement with ground-truth ATE and stable performance over time. The approach enables scalable, post-launch policy evaluation with granular HTE estimates, offering practical impact for e-commerce and ride-sharing platforms where rapid, reliable decision-making at scale is essential.
Abstract
A/B tests, also known as randomized controlled experiments (RCTs), are the gold standard for evaluating the impact of new policies, products, or decisions. However, these tests can be costly in terms of time and resources, potentially exposing users, customers, or other test subjects (units) to inferior options. This paper explores practical considerations in applying methodologies inspired by "synthetic control" as an alternative to traditional A/B testing in settings with very large numbers of units, involving up to hundreds of millions of units, which is common in modern applications such as e-commerce and ride-sharing platforms. This method is particularly valuable in settings where the treatment affects only a subset of units, leaving many units unaffected. In these scenarios, synthetic control methods leverage data from unaffected units to estimate counterfactual outcomes for treated units. After the treatment is implemented, these estimates can be compared to actual outcomes to measure the treatment effect. A key challenge in creating accurate counterfactual outcomes is interpolation bias, a well-documented phenomenon that occurs when control units differ significantly from treated units. To address this, we propose a two-phase approach: first using nearest neighbor matching based on unit covariates to select similar control units, then applying supervised learning methods suitable for high-dimensional data to estimate counterfactual outcomes. Testing using six large-scale experiments demonstrates that this approach successfully improves estimate accuracy. However, our analysis reveals that machine learning bias -- which arises from methods that trade off bias for variance reduction -- can impact results and affect conclusions about treatment effects. We document this bias in large-scale experimental settings and propose effective de-biasing techniques to address this challenge.
