Table of Contents
Fetching ...

Powerful A/B-Testing Metrics and Where to Find Them

Olivier Jeunen, Shubham Baweja, Neeti Pokharna, Aleksei Ustimenko

TL;DR

The paper addresses evaluating online evaluation metrics for A/B tests beyond a single North Star by quantifying decision quality through $\\varepsilon_{\\rm I}$, $\\varepsilon_{\\rm II}$, and $\\varepsilon_{\\rm III}$ and the distribution of statistical power across historical experiments. It builds a scalable pipeline that collects past experiments, labels them into Known outcomes $\\mathcal{E}^{+}$, Unknown outcomes $\\mathcal{E}^{?}$, and A/A outcomes $\\mathcal{E}^{\\simeq}$, and computes metric-level statistics such as $ z_m^{A \\succ B}$ to assess utility. Empirically on ShareChat (and Moj) data, the approach shows that selecting appropriate proxy metrics and applying multiple-testing corrections can raise power or reduce sampling needs, e.g., about $35\\%$ reduction in Type-II errors or a $3.5\\x$ sample-size saving for a proxy set including DAU, Engagers, and TimeSpent, with no Type-III errors observed. The work offers a practical, scalable framework for validating online-evaluation metrics, enabling faster, more confident experimentation and better-aligned platform evolution.

Abstract

Online controlled experiments, colloquially known as A/B-tests, are the bread and butter of real-world recommender system evaluation. Typically, end-users are randomly assigned some system variant, and a plethora of metrics are then tracked, collected, and aggregated throughout the experiment. A North Star metric (e.g. long-term growth or revenue) is used to assess which system variant should be deemed superior. As a result, most collected metrics are supporting in nature, and serve to either (i) provide an understanding of how the experiment impacts user experience, or (ii) allow for confident decision-making when the North Star metric moves insignificantly (i.e. a false negative or type-II error). The latter is not straightforward: suppose a treatment variant leads to fewer but longer sessions, with more views but fewer engagements; should this be considered a positive or negative outcome? The question then becomes: how do we assess a supporting metric's utility when it comes to decision-making using A/B-testing? Online platforms typically run dozens of experiments at any given time. This provides a wealth of information about interventions and treatment effects that can be used to evaluate metrics' utility for online evaluation. We propose to collect this information and leverage it to quantify type-I, type-II, and type-III errors for the metrics of interest, alongside a distribution of measurements of their statistical power (e.g. $z$-scores and $p$-values). We present results and insights from building this pipeline at scale for two large-scale short-video platforms: ShareChat and Moj; leveraging hundreds of past experiments to find online metrics with high statistical power.

Powerful A/B-Testing Metrics and Where to Find Them

TL;DR

The paper addresses evaluating online evaluation metrics for A/B tests beyond a single North Star by quantifying decision quality through , , and and the distribution of statistical power across historical experiments. It builds a scalable pipeline that collects past experiments, labels them into Known outcomes , Unknown outcomes , and A/A outcomes , and computes metric-level statistics such as to assess utility. Empirically on ShareChat (and Moj) data, the approach shows that selecting appropriate proxy metrics and applying multiple-testing corrections can raise power or reduce sampling needs, e.g., about reduction in Type-II errors or a sample-size saving for a proxy set including DAU, Engagers, and TimeSpent, with no Type-III errors observed. The work offers a practical, scalable framework for validating online-evaluation metrics, enabling faster, more confident experimentation and better-aligned platform evolution.

Abstract

Online controlled experiments, colloquially known as A/B-tests, are the bread and butter of real-world recommender system evaluation. Typically, end-users are randomly assigned some system variant, and a plethora of metrics are then tracked, collected, and aggregated throughout the experiment. A North Star metric (e.g. long-term growth or revenue) is used to assess which system variant should be deemed superior. As a result, most collected metrics are supporting in nature, and serve to either (i) provide an understanding of how the experiment impacts user experience, or (ii) allow for confident decision-making when the North Star metric moves insignificantly (i.e. a false negative or type-II error). The latter is not straightforward: suppose a treatment variant leads to fewer but longer sessions, with more views but fewer engagements; should this be considered a positive or negative outcome? The question then becomes: how do we assess a supporting metric's utility when it comes to decision-making using A/B-testing? Online platforms typically run dozens of experiments at any given time. This provides a wealth of information about interventions and treatment effects that can be used to evaluate metrics' utility for online evaluation. We propose to collect this information and leverage it to quantify type-I, type-II, and type-III errors for the metrics of interest, alongside a distribution of measurements of their statistical power (e.g. -scores and -values). We present results and insights from building this pipeline at scale for two large-scale short-video platforms: ShareChat and Moj; leveraging hundreds of past experiments to find online metrics with high statistical power.
Paper Structure (4 sections, 4 equations, 1 figure)

This paper contains 4 sections, 4 equations, 1 figure.

Figures (1)

  • Figure 1: Empirical evaluation of various online evaluation metrics on ShareChat data: at a 95% confidence level, we can reduce type-II errors by relative $35\%$ or reduce the necessary sample size by a factor $\times~3.5$ when considering the right set of metrics.