Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits
Siddhartha Banerjee, Sean R. Sinclair, Milind Tambe, Lily Xu, Christina Lee Yu
TL;DR
Artificial Replay introduces a general meta-algorithm to incorporate historical data into any base bandit algorithm, aiming to warm-start online learning without incurring the data inefficiency of full data utilization. By defining independence of irrelevant data (IIData), the authors show a sample-path coupling that yields identical regret to full warm-start while using a fraction of historical data; the framework adapts to finite, metric, and combinatorial bandits, and extends to continuous action spaces via adaptive discretization. Theoretical results establish regret equivalence under IIData and quantify data-efficiency gains, with empirical validation on K-armed and CMAB-CRA settings using real poaching data in green-security domains. Even base algorithms that do not satisfy IIData, such as Thompson sampling, exhibit empirical improvements under Artificial Replay, illustrating practical robustness. The work highlights significant reductions in upfront computation and storage, while maintaining optimal regret, and provides a unified approach to exploit historical data across diverse bandit models and applications.
Abstract
Most real-world deployments of bandit algorithms exist somewhere in between the offline and online set-up, where some historical data is available upfront and additional data is collected dynamically online. How best to incorporate historical data to "warm start" bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to data inefficiency (amount of historical data used) - particularly for continuous action spaces. To address these challenges, we propose ArtificialReplay, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. We show that ArtificialReplay uses only a fraction of the historical data compared to a full warm-start approach, while still achieving identical regret for base algorithms that satisfy independence of irrelevant data (IIData), a novel and broadly applicable property that we introduce. We complement these theoretical results with experiments on K-armed bandits and continuous combinatorial bandits, on which we model green security domains using real poaching data. Our results show the practical benefits of ArtificialReplay for improving data efficiency, including for base algorithms that do not satisfy IIData.
