AExGym: Benchmarks and Environments for Adaptive Experimentation
Jimmy Wang, Ethan Che, Daniel R. Jiang, Hongseok Namkoong
TL;DR
This paper introduces AExGym, an open-source benchmark and environment suite for adaptive experimentation in A/B testing settings. It emphasizes practical challenges such as non-stationarity, batched feedback, multiple objectives, constraints, and external validity, and provides real-world datasets to benchmark adaptive policies beyond idealized theory. The framework models adaptive experiments as MDPs with an Environment, an Agent, and flexible evaluation criteria, enabling both in-experiment and post-experiment assessments including best-arm identification and personalization. Empirical results across Meager, NHIS, ASOS, and field datasets reveal that static baselines can outperform adaptive methods under operational constraints, underscoring the need for robust, constraint-aware policies. The work aims to drive inductive, data-driven development of adaptive strategies that perform well in real-world deployment.
Abstract
Innovations across science and industry are evaluated using randomized trials (a.k.a. A/B tests). While simple and robust, such static designs are inefficient or infeasible for testing many hypotheses. Adaptive designs can greatly improve statistical power in theory, but they have seen limited adoption due to their fragility in practice. We present a benchmark for adaptive experimentation based on real-world datasets, highlighting prominent practical challenges to operationalizing adaptivity: non-stationarity, batched/delayed feedback, multiple outcomes and objectives, and external validity. Our benchmark aims to spur methodological development that puts practical performance (e.g., robustness) as a central concern, rather than mathematical guarantees on contrived instances. We release an open source library, AExGym, which is designed with modularity and extensibility in mind to allow experimentation practitioners to develop custom environments and algorithms.
