Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches

Yu Liu; Runzhe Wan; James McQueen; Doug Hains; Jinxiang Gu; Rui Song

Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches

Yu Liu, Runzhe Wan, James McQueen, Doug Hains, Jinxiang Gu, Rui Song

TL;DR

The paper tackles how to automatically choose the assumed effect size (AES) used to determine the duration of online A/B tests. It presents two complementary approaches: (i) a three-layer heteroscedastic Gaussian Mixture Model (GMM) that accounts for experiment-level heterogeneity and identifies positive effects to estimate the AES, and (ii) a Bayesian utility framework that optimizes AES by balancing the cost of running experiments with the accuracy of decision-making. The authors provide an EM-based estimation procedure for the GMM, include a variance-penalty extension to prevent degeneracy, and apply a grid-search optimization for the utility-based AES. Through simulations and a large-scale meta-analysis on 3,300 real Amazon experiments (plus additional simulations), they demonstrate that the three-layer GMM yields more accurate AES estimates and that the utility-based approach achieves higher expected cumulative rewards, validating both methods for scalable duration recommendations in large online experimentation services. The work offers practical, data-driven tools for automated duration planning and points to future directions in personalization and Bayesian nonparametrics to further tailor AES recommendations.

Abstract

The selection of the assumed effect size (AES) critically determines the duration of an experiment, and hence its accuracy and efficiency. Traditionally, experimenters determine AES based on domain knowledge. However, this method becomes impractical for online experimentation services managing numerous experiments, and a more automated approach is hence of great demand. We initiate the study of data-driven AES selection in for online experimentation services by introducing two solutions. The first employs a three-layer Gaussian Mixture Model considering the heteroskedasticity across experiments, and it seeks to estimate the true expected effect size among positive experiments. The second method, grounded in utility theory, aims to determine the optimal effect size by striking a balance between the experiment's cost and the precision of decision-making. Through comparisons with baseline methods using both simulated and real data, we showcase the superior performance of the proposed approaches.

Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches

TL;DR

Abstract

Paper Structure (10 sections, 18 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 10 sections, 18 equations, 3 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Preliminaries and Notations
Pooled Effect Size
Three-Layer Heteroscedastic GMM
Utility Theory: Bayesian Optimal Effect Size
Experiments
Accuracy Comparison with Simulation
Meta-Analysis with Real Experiments
Conclusion

Figures (3)

Figure 1: Illustration of the motivation of using GMM. The x-axis is the effect size and y-axis is the frequency. Data in (a) are from real experiments. The x-axis in this plot is not annotated owing to business confidentiality. (b) was simulated through a two-layer Gaussian Mixture Model (GMM) with mean values of $(-1, 0, 1)$, variances of $(0.2^2, 0.2^2, 0.2^2)$, and component weights of $(0.2, 0.6, 0.2)$. (c) was simulated from the same model as (b) with variances of $(0.7^2, 0.3^2, 0.7^2)$.
Figure 2: Histogram of simulated observed effect size $d_i$.
Figure 3: Boxplots comparing the AES estimations among pooled effect size, standard two-layer GMM and the proposed three-layer heterocasdestic GMM. Ground truth is 2.

Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches

TL;DR

Abstract

Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches

Authors

TL;DR

Abstract

Table of Contents

Figures (3)