Effect Size Estimation for Duration Recommendation in Online Experiments: Leveraging Hierarchical Models and Objective Utility Approaches
Yu Liu, Runzhe Wan, James McQueen, Doug Hains, Jinxiang Gu, Rui Song
TL;DR
The paper tackles how to automatically choose the assumed effect size (AES) used to determine the duration of online A/B tests. It presents two complementary approaches: (i) a three-layer heteroscedastic Gaussian Mixture Model (GMM) that accounts for experiment-level heterogeneity and identifies positive effects to estimate the AES, and (ii) a Bayesian utility framework that optimizes AES by balancing the cost of running experiments with the accuracy of decision-making. The authors provide an EM-based estimation procedure for the GMM, include a variance-penalty extension to prevent degeneracy, and apply a grid-search optimization for the utility-based AES. Through simulations and a large-scale meta-analysis on 3,300 real Amazon experiments (plus additional simulations), they demonstrate that the three-layer GMM yields more accurate AES estimates and that the utility-based approach achieves higher expected cumulative rewards, validating both methods for scalable duration recommendations in large online experimentation services. The work offers practical, data-driven tools for automated duration planning and points to future directions in personalization and Bayesian nonparametrics to further tailor AES recommendations.
Abstract
The selection of the assumed effect size (AES) critically determines the duration of an experiment, and hence its accuracy and efficiency. Traditionally, experimenters determine AES based on domain knowledge. However, this method becomes impractical for online experimentation services managing numerous experiments, and a more automated approach is hence of great demand. We initiate the study of data-driven AES selection in for online experimentation services by introducing two solutions. The first employs a three-layer Gaussian Mixture Model considering the heteroskedasticity across experiments, and it seeks to estimate the true expected effect size among positive experiments. The second method, grounded in utility theory, aims to determine the optimal effect size by striking a balance between the experiment's cost and the precision of decision-making. Through comparisons with baseline methods using both simulated and real data, we showcase the superior performance of the proposed approaches.
