Choosing a Proxy Metric from Past Experiments
Nilesh Tripuraneni, Lee Richardson, Alexander D'Amour, Jacopo Soriano, Steve Yadlowsky
TL;DR
The paper tackles the challenge of inferring a long-term outcome in large-scale A/B tests by constructing optimal short-term proxy metrics. It defines proxy quality as a correlation-driven objective that balances latent alignment with the long-term outcome and the experiment's noise level, and then reduces proxy selection to a portfolio-optimization problem over base proxies, with weights that adapt to the experiment's sample size. A hierarchical model denoises historical TE data to estimate latent covariances, which feed the optimization and yield a composite proxy that adapts to noise and improves decisions in new experiments. Evaluated on 307 real A/B tests from an industrial recommender system, the resulting composite proxy outperforms baselines in proxy score and proxy quality, demonstrating tangible gains in near-term decision-making while accounting for heterogeneity in experimental noise. Overall, the framework provides a principled, data-driven method to replace or augment long-horizon metrics with adaptive, information-rich surrogates that better guide product decisions.
Abstract
In many randomized experiments, the treatment effect of the long-term metric (i.e. the primary outcome of interest) is often difficult or infeasible to measure. Such long-term metrics are often slow to react to changes and sufficiently noisy they are challenging to faithfully estimate in short-horizon experiments. A common alternative is to measure several short-term proxy metrics in the hope they closely track the long-term metric -- so they can be used to effectively guide decision-making in the near-term. We introduce a new statistical framework to both define and construct an optimal proxy metric for use in a homogeneous population of randomized experiments. Our procedure first reduces the construction of an optimal proxy metric in a given experiment to a portfolio optimization problem which depends on the true latent treatment effects and noise level of experiment under consideration. We then denoise the observed treatment effects of the long-term metric and a set of proxies in a historical corpus of randomized experiments to extract estimates of the latent treatment effects for use in the optimization problem. One key insight derived from our approach is that the optimal proxy metric for a given experiment is not apriori fixed; rather it should depend on the sample size (or effective noise level) of the randomized experiment for which it is deployed. To instantiate and evaluate our framework, we employ our methodology in a large corpus of randomized experiments from an industrial recommendation system and construct proxy metrics that perform favorably relative to several baselines.
