Table of Contents
Fetching ...

Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits

Siddhartha Banerjee, Sean R. Sinclair, Milind Tambe, Lily Xu, Christina Lee Yu

TL;DR

Artificial Replay introduces a general meta-algorithm to incorporate historical data into any base bandit algorithm, aiming to warm-start online learning without incurring the data inefficiency of full data utilization. By defining independence of irrelevant data (IIData), the authors show a sample-path coupling that yields identical regret to full warm-start while using a fraction of historical data; the framework adapts to finite, metric, and combinatorial bandits, and extends to continuous action spaces via adaptive discretization. Theoretical results establish regret equivalence under IIData and quantify data-efficiency gains, with empirical validation on K-armed and CMAB-CRA settings using real poaching data in green-security domains. Even base algorithms that do not satisfy IIData, such as Thompson sampling, exhibit empirical improvements under Artificial Replay, illustrating practical robustness. The work highlights significant reductions in upfront computation and storage, while maintaining optimal regret, and provides a unified approach to exploit historical data across diverse bandit models and applications.

Abstract

Most real-world deployments of bandit algorithms exist somewhere in between the offline and online set-up, where some historical data is available upfront and additional data is collected dynamically online. How best to incorporate historical data to "warm start" bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to data inefficiency (amount of historical data used) - particularly for continuous action spaces. To address these challenges, we propose ArtificialReplay, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. We show that ArtificialReplay uses only a fraction of the historical data compared to a full warm-start approach, while still achieving identical regret for base algorithms that satisfy independence of irrelevant data (IIData), a novel and broadly applicable property that we introduce. We complement these theoretical results with experiments on K-armed bandits and continuous combinatorial bandits, on which we model green security domains using real poaching data. Our results show the practical benefits of ArtificialReplay for improving data efficiency, including for base algorithms that do not satisfy IIData.

Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits

TL;DR

Artificial Replay introduces a general meta-algorithm to incorporate historical data into any base bandit algorithm, aiming to warm-start online learning without incurring the data inefficiency of full data utilization. By defining independence of irrelevant data (IIData), the authors show a sample-path coupling that yields identical regret to full warm-start while using a fraction of historical data; the framework adapts to finite, metric, and combinatorial bandits, and extends to continuous action spaces via adaptive discretization. Theoretical results establish regret equivalence under IIData and quantify data-efficiency gains, with empirical validation on K-armed and CMAB-CRA settings using real poaching data in green-security domains. Even base algorithms that do not satisfy IIData, such as Thompson sampling, exhibit empirical improvements under Artificial Replay, illustrating practical robustness. The work highlights significant reductions in upfront computation and storage, while maintaining optimal regret, and provides a unified approach to exploit historical data across diverse bandit models and applications.

Abstract

Most real-world deployments of bandit algorithms exist somewhere in between the offline and online set-up, where some historical data is available upfront and additional data is collected dynamically online. How best to incorporate historical data to "warm start" bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to data inefficiency (amount of historical data used) - particularly for continuous action spaces. To address these challenges, we propose ArtificialReplay, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. We show that ArtificialReplay uses only a fraction of the historical data compared to a full warm-start approach, while still achieving identical regret for base algorithms that satisfy independence of irrelevant data (IIData), a novel and broadly applicable property that we introduce. We complement these theoretical results with experiments on K-armed bandits and continuous combinatorial bandits, on which we model green security domains using real poaching data. Our results show the practical benefits of ArtificialReplay for improving data efficiency, including for base algorithms that do not satisfy IIData.
Paper Structure (66 sections, 12 theorems, 54 equations, 9 figures, 1 table, 5 algorithms)

This paper contains 66 sections, 12 theorems, 54 equations, 9 figures, 1 table, 5 algorithms.

Key Result

Theorem 1

The MonUCB base algorithm has for $\Delta(a) = \max_{a^\prime} \mu(a^\prime) - \mu(a)$:

Figures (9)

  • Figure 1: ($K$-armed) Increasing the number of historical samples $H$ leads Full Start to use unnecessary data, particularly as $H$ gets very large. Artificial Replay achieves equal performance in terms of regret (plot a) while using less than half the historical data (plot b). In plot (c) we see that with $H=1{,}000$ historical samples, Artificial Replay uses (on average) 117 historical samples before taking its first online action. The number of historical samples used increases at a decreasing rate, using only 396 of $1{,}000$ total samples by the horizon $T$. Results are shown on the $K$-armed bandit setting with $K=10$ and horizon $T=1{,}000$.
  • Figure 2: (CMAB-CRA) Holding $H=10{,}000$ constant, we increase the fraction of historical data samples on bad arms (bottom 20% of rewards). The plots show (a) regret, (b) $\%$ of unused historical data, and (c) number of discretized regions in partition $\mathcal{P}$. Artificial Replay enables significantly improved runtime and reduced storage while matching the performance of Full Start. Results on the CMAB-CRA setting with adaptive discretization on the quadratic domain.
  • Figure 3: (CMAB-CRA) Cumulative regret ($y$-axis; lower is better) across time $t \in [T]$. Artificial Replay performs equally as well as Full Start across all domain settings, including both fixed discretization (top row) and adaptive discretization (bottom). Regressor performs quite poorly.
  • Figure 4: Comparison of a fixed (middle) and adaptive (right) discretization on a two-dimensional resource set $\mathcal{S}$ for a fixed allocation level $\beta$. The underlying color gradient corresponds to the mean reward $\mu(\mathbf{p}, \beta)$ with red corresponding to higher value and blue to lower value (see figure on left for legend). The fixed discretization algorithm is forced to explore uniformly across the entire resource space. In contrast, the adaptive discretization algorithm is able to maintain a data efficient representation, even without knowing the underlying mean reward function a priori.
  • Figure 5: Reward function for the quadratic environment.
  • ...and 4 more figures

Theorems & Definitions (25)

  • Theorem 1
  • Definition 1: Ignorant algorithm
  • Definition 2: Full warm start
  • Definition 3
  • Definition 4: Independence of irrelevant data
  • Theorem 2: Regret Coupling of Artificial Replay to Full Start
  • Theorem 3: Regret Improvement of Artificial Replay
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • ...and 15 more