Table of Contents
Fetching ...

Online Bandits with (Biased) Offline Data: Adaptive Learning under Distribution Mismatch

Wang Chi Cheung, Lixing Lyu

TL;DR

This work studies online learning with offline data under distribution drift in stochastic multi-armed and combinatorial bandits. It proves an impossibility result showing that no non-anticipatory policy can beat vanilla UCB without a bound on offline-online drift, and introduces MIN-UCB and MIN-COMB-UCB, adaptive policies that leverage informative offline data via a valid bias bound V. The proposed methods yield tight instance-dependent and instance-independent regret bounds, quantified through a discrepancy measure omega(a) that captures how offline data align with online rewards. The framework is validated by numerical experiments and applied to dynamic pricing and social influence maximization, highlighting when offline data provide meaningful speedups and when they risk harm. Overall, the paper provides a principled, adaptive approach to integrating biased historical data into online learning with provable regret guarantees.

Abstract

Traditional online learning models are typically initialized from scratch. By contrast, contemporary real-world applications often have access to historical datasets that can potentially enhanced the online learning processes. We study how offline data can be leveraged to facilitate online learning in stochastic multi-armed bandits and combinatorial bandits. In our study, the probability distributions that govern the offline data and the online rewards can be different. We first show that, without a non-trivial upper bound on their difference, no non-anticipatory policy can outperform the classical Upper Confidence Bound (UCB) policy, even with the access to offline data. In complement, we propose an online policy MIN-UCB for multi-armed bandits. MIN-UCB outperforms the UCB when such an upper bound is available. MIN-UCB adaptively chooses to utilize the offline data when they are deemed informative, and to ignore them otherwise. We establish that MIN-UCB achieves tight regret bounds, in both instance independent and dependent settings. We generalize our approach to the combinatorial bandit setting by introducing MIN-COMB-UCB, and we provide corresponding instance dependent and instance independent regret bounds. We illustrate how various factors, such as the biases and the size of offline datasets, affect the utility of offline data in online learning. We discuss several applications and conduct numerical experiments to validate our findings.

Online Bandits with (Biased) Offline Data: Adaptive Learning under Distribution Mismatch

TL;DR

This work studies online learning with offline data under distribution drift in stochastic multi-armed and combinatorial bandits. It proves an impossibility result showing that no non-anticipatory policy can beat vanilla UCB without a bound on offline-online drift, and introduces MIN-UCB and MIN-COMB-UCB, adaptive policies that leverage informative offline data via a valid bias bound V. The proposed methods yield tight instance-dependent and instance-independent regret bounds, quantified through a discrepancy measure omega(a) that captures how offline data align with online rewards. The framework is validated by numerical experiments and applied to dynamic pricing and social influence maximization, highlighting when offline data provide meaningful speedups and when they risk harm. Overall, the paper provides a principled, adaptive approach to integrating biased historical data into online learning with provable regret guarantees.

Abstract

Traditional online learning models are typically initialized from scratch. By contrast, contemporary real-world applications often have access to historical datasets that can potentially enhanced the online learning processes. We study how offline data can be leveraged to facilitate online learning in stochastic multi-armed bandits and combinatorial bandits. In our study, the probability distributions that govern the offline data and the online rewards can be different. We first show that, without a non-trivial upper bound on their difference, no non-anticipatory policy can outperform the classical Upper Confidence Bound (UCB) policy, even with the access to offline data. In complement, we propose an online policy MIN-UCB for multi-armed bandits. MIN-UCB outperforms the UCB when such an upper bound is available. MIN-UCB adaptively chooses to utilize the offline data when they are deemed informative, and to ignore them otherwise. We establish that MIN-UCB achieves tight regret bounds, in both instance independent and dependent settings. We generalize our approach to the combinatorial bandit setting by introducing MIN-COMB-UCB, and we provide corresponding instance dependent and instance independent regret bounds. We illustrate how various factors, such as the biases and the size of offline datasets, affect the utility of offline data in online learning. We discuss several applications and conduct numerical experiments to validate our findings.
Paper Structure (52 sections, 19 theorems, 143 equations, 2 figures, 2 algorithms)

This paper contains 52 sections, 19 theorems, 143 equations, 2 figures, 2 algorithms.

Key Result

Theorem 1

Let $T_\text{S}(1), T_\text{S}(2)$ be arbitrary. Consider an arbitrary non-anticipatory policy $\pi$ (which only possesses the offline dataset $S$ but not the auxiliary input $V$) satisfies $\mathbb{E}[\text{Reg}_T(\pi, P)]\leq C T^{\beta - \epsilon}\log T$ on instance $I_P$, where $\epsilon \in (0, The following inequality holds: $\mathbb{E}[\text{Reg}(\pi, Q)] \geq$

Figures (2)

  • Figure 1: Effect of Discrepancy: Both magnitude and direction of bias are important.
  • Figure 2: Effect of $T$ and $T_{\text{S}}$: informative offline data can significantly enhance online learning.

Theorems & Definitions (24)

  • Theorem 1
  • Lemma 1
  • Theorem 2
  • Definition 1
  • Corollary 1
  • Definition 2
  • Definition 3
  • Theorem 3
  • Lemma 2
  • proof : Proof of Lemma \ref{['lemma:crucial_bound']}
  • ...and 14 more