Table of Contents
Fetching ...

Post Launch Evaluation of Policies in a High-Dimensional Setting

Shima Nassiri, Mohsen Bayati, Joe Cooprider

TL;DR

This work addresses post-launch evaluation in ultra-high-dimensional settings where traditional A/B testing is costly or impractical. It proposes a two-phase synthetic-control–inspired framework: first filter the donor pool with nearest-neighbor matching on unit covariates, then apply high-dimensional vertical regression to predict counterfactual outcomes $\,\hat{Y}_{it}(0)$ and estimate the average treatment effect $ au$ and heterogeneous effects $ au_{it}$. Across six large experiments, the two-phase method improves donor-pool alignment and counterfactual accuracy, but machine-learning bias can distort effect sizes, motivating a debiasing strategy that combines relative-error minimization with a bias penalty (with $ ext{alpha}\napprox 20$) to yield closer agreement with ground-truth ATE and stable performance over time. The approach enables scalable, post-launch policy evaluation with granular HTE estimates, offering practical impact for e-commerce and ride-sharing platforms where rapid, reliable decision-making at scale is essential.

Abstract

A/B tests, also known as randomized controlled experiments (RCTs), are the gold standard for evaluating the impact of new policies, products, or decisions. However, these tests can be costly in terms of time and resources, potentially exposing users, customers, or other test subjects (units) to inferior options. This paper explores practical considerations in applying methodologies inspired by "synthetic control" as an alternative to traditional A/B testing in settings with very large numbers of units, involving up to hundreds of millions of units, which is common in modern applications such as e-commerce and ride-sharing platforms. This method is particularly valuable in settings where the treatment affects only a subset of units, leaving many units unaffected. In these scenarios, synthetic control methods leverage data from unaffected units to estimate counterfactual outcomes for treated units. After the treatment is implemented, these estimates can be compared to actual outcomes to measure the treatment effect. A key challenge in creating accurate counterfactual outcomes is interpolation bias, a well-documented phenomenon that occurs when control units differ significantly from treated units. To address this, we propose a two-phase approach: first using nearest neighbor matching based on unit covariates to select similar control units, then applying supervised learning methods suitable for high-dimensional data to estimate counterfactual outcomes. Testing using six large-scale experiments demonstrates that this approach successfully improves estimate accuracy. However, our analysis reveals that machine learning bias -- which arises from methods that trade off bias for variance reduction -- can impact results and affect conclusions about treatment effects. We document this bias in large-scale experimental settings and propose effective de-biasing techniques to address this challenge.

Post Launch Evaluation of Policies in a High-Dimensional Setting

TL;DR

This work addresses post-launch evaluation in ultra-high-dimensional settings where traditional A/B testing is costly or impractical. It proposes a two-phase synthetic-control–inspired framework: first filter the donor pool with nearest-neighbor matching on unit covariates, then apply high-dimensional vertical regression to predict counterfactual outcomes and estimate the average treatment effect and heterogeneous effects . Across six large experiments, the two-phase method improves donor-pool alignment and counterfactual accuracy, but machine-learning bias can distort effect sizes, motivating a debiasing strategy that combines relative-error minimization with a bias penalty (with ) to yield closer agreement with ground-truth ATE and stable performance over time. The approach enables scalable, post-launch policy evaluation with granular HTE estimates, offering practical impact for e-commerce and ride-sharing platforms where rapid, reliable decision-making at scale is essential.

Abstract

A/B tests, also known as randomized controlled experiments (RCTs), are the gold standard for evaluating the impact of new policies, products, or decisions. However, these tests can be costly in terms of time and resources, potentially exposing users, customers, or other test subjects (units) to inferior options. This paper explores practical considerations in applying methodologies inspired by "synthetic control" as an alternative to traditional A/B testing in settings with very large numbers of units, involving up to hundreds of millions of units, which is common in modern applications such as e-commerce and ride-sharing platforms. This method is particularly valuable in settings where the treatment affects only a subset of units, leaving many units unaffected. In these scenarios, synthetic control methods leverage data from unaffected units to estimate counterfactual outcomes for treated units. After the treatment is implemented, these estimates can be compared to actual outcomes to measure the treatment effect. A key challenge in creating accurate counterfactual outcomes is interpolation bias, a well-documented phenomenon that occurs when control units differ significantly from treated units. To address this, we propose a two-phase approach: first using nearest neighbor matching based on unit covariates to select similar control units, then applying supervised learning methods suitable for high-dimensional data to estimate counterfactual outcomes. Testing using six large-scale experiments demonstrates that this approach successfully improves estimate accuracy. However, our analysis reveals that machine learning bias -- which arises from methods that trade off bias for variance reduction -- can impact results and affect conclusions about treatment effects. We document this bias in large-scale experimental settings and propose effective de-biasing techniques to address this challenge.
Paper Structure (26 sections, 5 equations, 2 figures, 6 tables)

This paper contains 26 sections, 5 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Comparison of prediction error (left) and bias distribution (right) across methods in Experiment C. PCRRidge-2, Ridge-2, and PCRLasso-2 implement debiasing through modified hyperparameter tuning, whereas RidgeDebiased-2 uses an alternative approach based on sample splitting. White dots in the right panel indicate distribution means.
  • Figure 2: Heterogeneous treatment effects illustrated through product-level time series. Each panel shows outcomes for a different product: treated values (blue line) and non-treated values (orange line) over time, with the vertical red dashed line indicating treatment start. The orange line after treatment represents predicted counterfactual outcomes, while the blue line shows actual outcomes. The distinct patterns across products—showing both positive and negative gaps between treated and counterfactual values—demonstrate how treatment effects can vary substantially across units. Note the different scales and volatility patterns across products, highlighting the challenge of accurate counterfactual prediction in highly noisy and heterogeneous settings.