Table of Contents
Fetching ...

AExGym: Benchmarks and Environments for Adaptive Experimentation

Jimmy Wang, Ethan Che, Daniel R. Jiang, Hongseok Namkoong

TL;DR

This paper introduces AExGym, an open-source benchmark and environment suite for adaptive experimentation in A/B testing settings. It emphasizes practical challenges such as non-stationarity, batched feedback, multiple objectives, constraints, and external validity, and provides real-world datasets to benchmark adaptive policies beyond idealized theory. The framework models adaptive experiments as MDPs with an Environment, an Agent, and flexible evaluation criteria, enabling both in-experiment and post-experiment assessments including best-arm identification and personalization. Empirical results across Meager, NHIS, ASOS, and field datasets reveal that static baselines can outperform adaptive methods under operational constraints, underscoring the need for robust, constraint-aware policies. The work aims to drive inductive, data-driven development of adaptive strategies that perform well in real-world deployment.

Abstract

Innovations across science and industry are evaluated using randomized trials (a.k.a. A/B tests). While simple and robust, such static designs are inefficient or infeasible for testing many hypotheses. Adaptive designs can greatly improve statistical power in theory, but they have seen limited adoption due to their fragility in practice. We present a benchmark for adaptive experimentation based on real-world datasets, highlighting prominent practical challenges to operationalizing adaptivity: non-stationarity, batched/delayed feedback, multiple outcomes and objectives, and external validity. Our benchmark aims to spur methodological development that puts practical performance (e.g., robustness) as a central concern, rather than mathematical guarantees on contrived instances. We release an open source library, AExGym, which is designed with modularity and extensibility in mind to allow experimentation practitioners to develop custom environments and algorithms.

AExGym: Benchmarks and Environments for Adaptive Experimentation

TL;DR

This paper introduces AExGym, an open-source benchmark and environment suite for adaptive experimentation in A/B testing settings. It emphasizes practical challenges such as non-stationarity, batched feedback, multiple objectives, constraints, and external validity, and provides real-world datasets to benchmark adaptive policies beyond idealized theory. The framework models adaptive experiments as MDPs with an Environment, an Agent, and flexible evaluation criteria, enabling both in-experiment and post-experiment assessments including best-arm identification and personalization. Empirical results across Meager, NHIS, ASOS, and field datasets reveal that static baselines can outperform adaptive methods under operational constraints, underscoring the need for robust, constraint-aware policies. The work aims to drive inductive, data-driven development of adaptive strategies that perform well in real-world deployment.

Abstract

Innovations across science and industry are evaluated using randomized trials (a.k.a. A/B tests). While simple and robust, such static designs are inefficient or infeasible for testing many hypotheses. Adaptive designs can greatly improve statistical power in theory, but they have seen limited adoption due to their fragility in practice. We present a benchmark for adaptive experimentation based on real-world datasets, highlighting prominent practical challenges to operationalizing adaptivity: non-stationarity, batched/delayed feedback, multiple outcomes and objectives, and external validity. Our benchmark aims to spur methodological development that puts practical performance (e.g., robustness) as a central concern, rather than mathematical guarantees on contrived instances. We release an open source library, AExGym, which is designed with modularity and extensibility in mind to allow experimentation practitioners to develop custom environments and algorithms.
Paper Structure (12 sections, 8 equations, 9 figures, 3 tables)

This paper contains 12 sections, 8 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The design of $\mathsf{AExGym}$ includes an $\mathsf{Environment}$, $\mathsf{Agent}$, and a set of criteria for evaluation. In each period, the $\mathsf{Environment}$ generates a batch of contexts (e.g., user features). The $\mathsf{Agent}$ receives these contexts, along with the entire history of the experimentation process (past context batches, assignments, and individual-level outcomes), and outputs a possibly personalized assignment policy. At the end of the experiment, the history of the experimentation process together with a final assignment policy from the $\mathsf{Agent}$ are evaluated by a set of practical evaluation criteria that may include one or more of the following: within-experiment costs, post-experiment objectives, budget constraints, or outcome constraints.
  • Figure 2: Illustration of performance degradation under constraints in a site selection task constructed from the Meager2019Understanding multi-site study. (Left) When each arm (site) can be sampled multiple times, Thompson Sampling based algorithms perform relatively well. (Right) When arms can only be sampled once (which is a common practical constraint), all algorithms perform worse than uniform.
  • Figure 3: Performance of various algorithms across 241 settings in the ASOS.com dataset. (Right) Plot of best-arm identification performance for algorithms that rely on contextual and non-contextual models. The methods tested universally perform worse than uniform due to the challenging, but naturally occurring non-stationary within the environment ($n_{t} = 100,000$). (Left) Illustration of performance losses due to batching. Contextual Thompson Sampling that updates after every sample vastly outperforms Uniform despite under-performing in the batched setting ($n_{t} = 10,000$).
  • Figure 4: Outcomes for a site selection task for the NHIS data. All algorithms generally do better than the Uniform allocation.
  • Figure 5: Raw regret values for the ASOS data with batch sizes of $10,000$, $100,000$, $250,000$. Regret values decrease as batch sizes get larger.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Example 1
  • Example 2: Linear TS
  • Example 3: Linear UCB