Optimization-Driven Adaptive Experimentation

Ethan Che; Daniel R. Jiang; Hongseok Namkoong; Jimmy Wang

Optimization-Driven Adaptive Experimentation

Ethan Che, Daniel R. Jiang, Hongseok Namkoong, Jimmy Wang

TL;DR

This work formsulating a dynamic program based on central limit approximations, which enables the use of scalable optimization methods based on auto-differentiation and GPU parallelization and presents a mathematical programming formulation that can flexibly incorporate a wide range of objectives, constraints, and statistical procedures.

Abstract

Real-world experiments involve batched & delayed feedback, non-stationarity, multiple objectives & constraints, and (often some) personalization. Tailoring adaptive methods to address these challenges on a per-problem basis is infeasible, and static designs remain the de facto standard. Focusing on short-horizon ($\le 10$) adaptive experiments, we move away from bespoke algorithms and present a mathematical programming formulation that can flexibly incorporate a wide range of objectives, constraints, and statistical procedures. We formulating a dynamic program based on central limit approximations, which enables the use of scalable optimization methods based on auto-differentiation and GPU parallelization. To evaluate our framework, we implement a simple heuristic planning method ("solver") and benchmark it across hundreds of problem instances involving non-stationarity, personalization, and multiple objectives & constraints. Unlike bespoke methods (e.g., Thompson sampling variants), our mathematical programming framework provides consistent gains over static randomized control trials and exhibits robust performance across problem instances.

Optimization-Driven Adaptive Experimentation

TL;DR

Abstract

) adaptive experiments, we move away from bespoke algorithms and present a mathematical programming formulation that can flexibly incorporate a wide range of objectives, constraints, and statistical procedures. We formulating a dynamic program based on central limit approximations, which enables the use of scalable optimization methods based on auto-differentiation and GPU parallelization. To evaluate our framework, we implement a simple heuristic planning method ("solver") and benchmark it across hundreds of problem instances involving non-stationarity, personalization, and multiple objectives & constraints. Unlike bespoke methods (e.g., Thompson sampling variants), our mathematical programming framework provides consistent gains over static randomized control trials and exhibits robust performance across problem instances.

Paper Structure (38 sections, 9 theorems, 109 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 38 sections, 9 theorems, 109 equations, 7 figures, 3 tables, 1 algorithm.

Introduction
Batched Adaptive Experimentation
Reward Models
Challenges of Optimization for Adaptive Experimentation
Batch Limit Dynamic Program
Large-Batch Statistical Approximations
A Bayesian MDP Formulation
Objectives and Constraints
Optimization Algorithm ($\mathsf{RHO}$)
Robustness Guarantee
Numerical Experiments
Non-stationarity via the ASOS Digital Experiments Dataset
Results and Discussion for the ASOS Dataset
Personalized Value Models for Ranking and Recommendations
Results and Discussion for Personalized Value Models
...and 23 more sections

Key Result

Lemma 1

Under a Gaussian prior $\theta^\star \sim N(\beta_0, \Sigma_0)$, given observations $\{G_{t} \}_{t=1}^{T}$ where $G_{t} \,|\, G_{1:t-1} \sim N(\theta^{*}, n_{t}^{-1}H_{t}^{-1} I_{t} H_{t}^{-1})$, the joint distribution of posterior states $\{(\beta_{t}, \Sigma_{t}) \}_{t=0}^{T}$ is characterized by where $H_{t}$ and $I_{t}$ are defined in eqn:hessian and eqn:grad_cov respectively and $Z_0, \ldots

Figures (7)

Figure 1: (Real-world non-stationarity) Benchmark results on 241 non-stationary settings based on 78 real experiments run at ASOS, a fashion retailer with over 26 million active customers as of 2024 liu2021datasets. Treatment effects vary significantly over days. We simulate 750 batched experiments across instances, resulting in 180,750 evaluations across different batch sizes and different policies. Plots shown for batch size $n_{t}=100$ and $T = 10$. Contextual policies model non-stationarity effects, but still pick a single arm at the end of the experiment; we compare $\mathsf{BLDP}$ with variants of Thompson sampling, including Contextual Top-Two Thompson sampling (TTTS) QinRusso2023 tailored for non-stationary settings. Right:Adaptive algorithms often do worse than uniform, overfitting on initial, temporary performance.$\mathsf{RHO}$ (stars) is the only adaptive policy with greater average reward and choose the best arm more often compared to uniform sampling, outperforming bandit algorithms even under model misspecification (non-contextual), across a wide range of learning rates. Left: Quantile of simple regret across 241 settings (normalized by that of uniform allocation). $\mathsf{RHO}$ outperforms uniform on more instances (60.5%) compared to other adaptive policies (51.8% for Top-Two Thompson Sampling). TS-based policies tend to be more fragile on difficult instances; when it underperforms Uniform it does so 10.7% on average (compared to 6.9% for $\mathsf{RHO}$).
Figure 2: Satisfying budget constraints while optimizing simple plus cumulative regret. Pareto frontier of the tradeoff between simple and cumulative regret while maintaining a fixed budget of $100$ units across the experiment. Each arm is associated with a fixed cost. $\mathsf{RHO}$ can exactly satisfy the budget constraint in expectation via projected gradient descent. Additionally, $\mathsf{RHO}$ trades off between simple and cumulative regret in a principled manner by setting weights equal to the number of individuals under each objective. We are not aware of any existing adaptive algorithm designed for this setting, so we combine Budgeted TS xia2015thompson designed for budget constraints and Top-Two TS Russo20 which can tradeoff between simple and cumulative regret qin2024optimize. In contrast to $\mathsf{RHO}$, it is difficult to tune Budgeted Top-Two TS in an exact and principled way to satisfy the objective and constraints. This method involves a scaling parameter $b\in [0, \infty)$ to satisfy the budget constraint and a scaling parameter $\alpha \in [0,1]$ to trade off between simple and cumulative regret. These parameters are difficult to tune as adjusting one parameter directly affects the other. Keeping the budget parameter $b$ constant while varying $\alpha$ to tradeoff between simple and cumulative regret yields average budget costs from 82.2 to 105.2 units, potentially violating the constraint. In contrast, keeping $\mathsf{RHO}$'s budget costs fixed yields average budget costs varying from 94.1 to 99.5 units.
Figure 3: $\mathsf{BLDP}$ provides a differentiable MDP that can be minimized via stochastic gradient descent. For a 3-armed experiment with horizon $T = 10$, we plot simple regret (red) over static sampling allocations $\pi(\mathcal{H}):= p = (p_{1}, p_{2}, 1-p_{1}-p_{2})$ optimized using the Adam optimizer.
Figure 4: (Personalized Value Models) Pareto frontier of the tradeoff between simple and cumulative regret with $5$ epochs of experimentation in a synthetic personalized ranking simulation. Lighter colors corresponding to TTTS indicate parameter values closer to $0$, while darker values correspond to values closer to $1$. Note that a value of $1$ is exactly TS. Similarly, lighter values for $\mathsf{RHO}$ correspond to higher weights on simple regret while darker values correspond to higher weights on cumulative regret. $\mathsf{RHO}$ is able to efficiently trade off between the two objectives, with many points strictly dominating all values associated with TTTS. As more weight is put on the simple regret term, the simple regret incurred decreases monotonically. $\mathsf{RHO}$ trades off in a principled and interpretable way simply by setting the weights equal to the number of individuals under each objective. On the other hand, there is no principled way of selecting the parameter values for TTTS, and most do not outperform Uniform in terms of simple regret.
Figure 5: Plot of simple and cumulative regret performance of various policies and $\mathsf{RHO}$ when it optimizes only for simple regret. (Left)$\mathsf{RHO}$ is the only policy that consistently outperforms Uniform. (Right) Plotting the cumulative regret performance shows that both TS and TTTS incur lower cumulative regret than $\mathsf{RHO}$, potentially showing that they are too greedy, giving a possible explanation as to why they fail compared to $\mathsf{RHO}$ and Uniform.
...and 2 more figures

Theorems & Definitions (14)

Definition 1
Lemma 1
Definition 2
Theorem 1: Policy Improvement
Definition 3
Theorem 2
Definition 4
Definition 5
Corollary 1
Corollary 2
...and 4 more

Optimization-Driven Adaptive Experimentation

TL;DR

Abstract

Optimization-Driven Adaptive Experimentation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (14)