Table of Contents
Fetching ...

$t$-Testing the Waters: Empirically Validating Assumptions for Reliable A/B-Testing

Olivier Jeunen

TL;DR

The paper addresses the reliability of $t$-tests for estimating the Average Treatment Effect ($ATE$) in online experiments when outcomes may be non-normal. It proposes an empirical validation workflow based on repeated A/A-tests and a Kolmogorov-Smirnov test to assess the uniformity of the $p$-value distribution and the coverage of confidence intervals. The authors provide a practical framework, derive standard error formulas for the difference-in-means estimator, and demonstrate the method on large-scale real-world data, revealing when CLT-based inferences may fail and how to diagnose them. They also analyze how event frequency and distributional skewness relate to CLT convergence, showing that KS diagnostics offer value beyond simple covariates. The work aims to improve reliability and robustness of A/B-testing practices by flagging problematic designs and suggesting nonparametric remedies when necessary.

Abstract

A/B-tests are a cornerstone of experimental design on the web, with wide-ranging applications and use-cases. The statistical $t$-test comparing differences in means is the most commonly used method for assessing treatment effects, often justified through the Central Limit Theorem (CLT). The CLT ascertains that, as the sample size grows, the sampling distribution of the Average Treatment Effect converges to normality, making the $t$-test valid for sufficiently large sample sizes. When outcome measures are skewed or non-normal, quantifying what "sufficiently large" entails is not straightforward. To ensure that confidence intervals maintain proper coverage and that $p$-values accurately reflect the false positive rate, it is critical to validate this normality assumption. We propose a practical method to test this, by analysing repeatedly resampled A/A-tests. When the normality assumption holds, the resulting $p$-value distribution should be uniform, and this property can be tested using the Kolmogorov-Smirnov test. This provides an efficient and effective way to empirically assess whether the $t$-test's assumptions are met, and the A/B-test is valid. We demonstrate our methodology and highlight how it helps to identify scenarios prone to inflated Type-I errors. Our approach provides a practical framework to ensure and improve the reliability and robustness of A/B-testing practices.

$t$-Testing the Waters: Empirically Validating Assumptions for Reliable A/B-Testing

TL;DR

The paper addresses the reliability of -tests for estimating the Average Treatment Effect () in online experiments when outcomes may be non-normal. It proposes an empirical validation workflow based on repeated A/A-tests and a Kolmogorov-Smirnov test to assess the uniformity of the -value distribution and the coverage of confidence intervals. The authors provide a practical framework, derive standard error formulas for the difference-in-means estimator, and demonstrate the method on large-scale real-world data, revealing when CLT-based inferences may fail and how to diagnose them. They also analyze how event frequency and distributional skewness relate to CLT convergence, showing that KS diagnostics offer value beyond simple covariates. The work aims to improve reliability and robustness of A/B-testing practices by flagging problematic designs and suggesting nonparametric remedies when necessary.

Abstract

A/B-tests are a cornerstone of experimental design on the web, with wide-ranging applications and use-cases. The statistical -test comparing differences in means is the most commonly used method for assessing treatment effects, often justified through the Central Limit Theorem (CLT). The CLT ascertains that, as the sample size grows, the sampling distribution of the Average Treatment Effect converges to normality, making the -test valid for sufficiently large sample sizes. When outcome measures are skewed or non-normal, quantifying what "sufficiently large" entails is not straightforward. To ensure that confidence intervals maintain proper coverage and that -values accurately reflect the false positive rate, it is critical to validate this normality assumption. We propose a practical method to test this, by analysing repeatedly resampled A/A-tests. When the normality assumption holds, the resulting -value distribution should be uniform, and this property can be tested using the Kolmogorov-Smirnov test. This provides an efficient and effective way to empirically assess whether the -test's assumptions are met, and the A/B-test is valid. We demonstrate our methodology and highlight how it helps to identify scenarios prone to inflated Type-I errors. Our approach provides a practical framework to ensure and improve the reliability and robustness of A/B-testing practices.

Paper Structure

This paper contains 9 sections, 6 equations, 3 figures.

Figures (3)

  • Figure 1: Visualising the Kolmogorov-Smirnov $D$-statistic and resulting $p$-value per user-event we measure. Whilst the majority of $p$-value distributions cannot be distinguished from uniform, we reject the null hypothesis for several.
  • Figure 2: Visualising the number of event observations overall to their $D$-statistic, with a log-linear trendline. Whilst rare events lead to an increase in distribution divergence, the relationship is not monotonic (Spearman's $\rho\approx0.45$).
  • Figure 3: The empirical density function for various events intuitively shows that the sample skewness of the empirical event distribution per user is an indicator of the required sample size for the CLT to kick in, and the mean event distribution to approach normality (Spearman's $\rho\approx0.43$).