Table of Contents
Fetching ...

Choosing a Proxy Metric from Past Experiments

Nilesh Tripuraneni, Lee Richardson, Alexander D'Amour, Jacopo Soriano, Steve Yadlowsky

TL;DR

The paper tackles the challenge of inferring a long-term outcome in large-scale A/B tests by constructing optimal short-term proxy metrics. It defines proxy quality as a correlation-driven objective that balances latent alignment with the long-term outcome and the experiment's noise level, and then reduces proxy selection to a portfolio-optimization problem over base proxies, with weights that adapt to the experiment's sample size. A hierarchical model denoises historical TE data to estimate latent covariances, which feed the optimization and yield a composite proxy that adapts to noise and improves decisions in new experiments. Evaluated on 307 real A/B tests from an industrial recommender system, the resulting composite proxy outperforms baselines in proxy score and proxy quality, demonstrating tangible gains in near-term decision-making while accounting for heterogeneity in experimental noise. Overall, the framework provides a principled, data-driven method to replace or augment long-horizon metrics with adaptive, information-rich surrogates that better guide product decisions.

Abstract

In many randomized experiments, the treatment effect of the long-term metric (i.e. the primary outcome of interest) is often difficult or infeasible to measure. Such long-term metrics are often slow to react to changes and sufficiently noisy they are challenging to faithfully estimate in short-horizon experiments. A common alternative is to measure several short-term proxy metrics in the hope they closely track the long-term metric -- so they can be used to effectively guide decision-making in the near-term. We introduce a new statistical framework to both define and construct an optimal proxy metric for use in a homogeneous population of randomized experiments. Our procedure first reduces the construction of an optimal proxy metric in a given experiment to a portfolio optimization problem which depends on the true latent treatment effects and noise level of experiment under consideration. We then denoise the observed treatment effects of the long-term metric and a set of proxies in a historical corpus of randomized experiments to extract estimates of the latent treatment effects for use in the optimization problem. One key insight derived from our approach is that the optimal proxy metric for a given experiment is not apriori fixed; rather it should depend on the sample size (or effective noise level) of the randomized experiment for which it is deployed. To instantiate and evaluate our framework, we employ our methodology in a large corpus of randomized experiments from an industrial recommendation system and construct proxy metrics that perform favorably relative to several baselines.

Choosing a Proxy Metric from Past Experiments

TL;DR

The paper tackles the challenge of inferring a long-term outcome in large-scale A/B tests by constructing optimal short-term proxy metrics. It defines proxy quality as a correlation-driven objective that balances latent alignment with the long-term outcome and the experiment's noise level, and then reduces proxy selection to a portfolio-optimization problem over base proxies, with weights that adapt to the experiment's sample size. A hierarchical model denoises historical TE data to estimate latent covariances, which feed the optimization and yield a composite proxy that adapts to noise and improves decisions in new experiments. Evaluated on 307 real A/B tests from an industrial recommender system, the resulting composite proxy outperforms baselines in proxy score and proxy quality, demonstrating tangible gains in near-term decision-making while accounting for heterogeneity in experimental noise. Overall, the framework provides a principled, data-driven method to replace or augment long-horizon metrics with adaptive, information-rich surrogates that better guide product decisions.

Abstract

In many randomized experiments, the treatment effect of the long-term metric (i.e. the primary outcome of interest) is often difficult or infeasible to measure. Such long-term metrics are often slow to react to changes and sufficiently noisy they are challenging to faithfully estimate in short-horizon experiments. A common alternative is to measure several short-term proxy metrics in the hope they closely track the long-term metric -- so they can be used to effectively guide decision-making in the near-term. We introduce a new statistical framework to both define and construct an optimal proxy metric for use in a homogeneous population of randomized experiments. Our procedure first reduces the construction of an optimal proxy metric in a given experiment to a portfolio optimization problem which depends on the true latent treatment effects and noise level of experiment under consideration. We then denoise the observed treatment effects of the long-term metric and a set of proxies in a historical corpus of randomized experiments to extract estimates of the latent treatment effects for use in the optimization problem. One key insight derived from our approach is that the optimal proxy metric for a given experiment is not apriori fixed; rather it should depend on the sample size (or effective noise level) of the randomized experiment for which it is deployed. To instantiate and evaluate our framework, we employ our methodology in a large corpus of randomized experiments from an industrial recommendation system and construct proxy metrics that perform favorably relative to several baselines.
Paper Structure (16 sections, 15 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 16 sections, 15 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: In a new experiment, we view the observed TEs as being generated from their corresponding (unobserved) latent values by a noisy channel which adds independent, mean-zero experimental noise with covariance $\bm{\Xi}$. In this new experiment the noisy, observed long-term outcome ${ \hat{\Delta}^N}$ is inaccessible. We seek to find noisy proxy metrics whose TEs closely track the population TEs on the long-term outcome.
  • Figure 2: The panel visualizes the denoising effect of fitting a hierarchical model to raw TEs to uncover their latent variation on synthetic data. We generate 1500 synthetic datapoints sampled from the model in \ref{['eq:level1']} with one proxy metric. Each datapoint represents a synthetic TE measurement from a single A/B test. We use parameters with $\mu^N\mu^P = 0.00.0$, $\bm{\Lambda} = .01 \cdot 10.20.21$ to generate data in \ref{['fig:a']}. We add Gaussian noise with covariance $\bm{\Xi} = .02 \cdot 10.70.71$ to them in \ref{['fig:b']}. Finally, we fit the generative model to the observed data in \ref{['eq:hm']} using the within-experiment covariances $\bm{\Xi}$ in \ref{['fig:c']}. \ref{['fig:c']} illustrates how the hierarchical model denoises the raw observed TEs to disentangle the latent variation in the population from the experimental noise in each synthetic A/B test.
  • Figure 3: The optimal weighting dependence on sample size for our new composite proxy, represents a bias-variance trade-off. For large sample sizes the weighting favors potentially noisier metrics that are more aligned with the long term outcome. However, for smaller sample sizes the optimal weighting backs off to metrics which are less noisy but also less aligned to the long term outcome.
  • Figure 4: Both displays show the within-experiment marginal sample variance (blue dots) for two different metrics computed across 307 different A/B tests and their corresponding power-law fit (red line). Despite the underlying A/B tests being different, we found that the variance were reasonably well modeled by a single inverse-power law with the same constant prefactor over the entire population.
  • Figure 5: A synthetic contingency table which captures the alignment of the decisions induced by the t-statistics of the TEs of the north star metric and a proxy metric.