Table of Contents
Fetching ...

Balancing optimism and pessimism in offline-to-online learning

Flore Sentenac, Ilbin Lee, Csaba Szepesvari

TL;DR

This work studies offline-to-online learning in stochastic multi-armed bandits, where an offline data budget $m_i$ informs initial policy and online horizon $T$ governs regret. It introduces OtO, a horizon-adaptive method that uses a budget to switch between UCB (optimistic) and LCB (pessimistic) strategies, achieving near-optimal performance relative to the better of UCB and LCB at any time. Theoretical guarantees provide regret bounds against both the optimal policy and the logging policy across the offline-to-online spectrum, with an unknown-horizon extension via horizon doubling. Empirical results on synthetic and real ad-click data show OtO consistently matches or approaches the best of UCB and LCB across horizons and offline-data configurations, suggesting broad applicability to extension beyond MABs, including contextual bandits and reinforcement learning.

Abstract

We consider what we call the offline-to-online learning setting, focusing on stochastic finite-armed bandit problems. In offline-to-online learning, a learner starts with offline data collected from interactions with an unknown environment in a way that is not under the learner's control. Given this data, the learner begins interacting with the environment, gradually improving its initial strategy as it collects more data to maximize its total reward. The learner in this setting faces a fundamental dilemma: if the policy is deployed for only a short period, a suitable strategy (in a number of senses) is the Lower Confidence Bound (LCB) algorithm, which is based on pessimism. LCB can effectively compete with any policy that is sufficiently "covered" by the offline data. However, for longer time horizons, a preferred strategy is the Upper Confidence Bound (UCB) algorithm, which is based on optimism. Over time, UCB converges to the performance of the optimal policy at a rate that is nearly the best possible among all online algorithms. In offline-to-online learning, however, UCB initially explores excessively, leading to worse short-term performance compared to LCB. This suggests that a learner not in control of how long its policy will be in use should start with LCB for short horizons and gradually transition to a UCB-like strategy as more rounds are played. This article explores how and why this transition should occur. Our main result shows that our new algorithm performs nearly as well as the better of LCB and UCB at any point in time. The core idea behind our algorithm is broadly applicable, and we anticipate that our results will extend beyond the multi-armed bandit setting.

Balancing optimism and pessimism in offline-to-online learning

TL;DR

This work studies offline-to-online learning in stochastic multi-armed bandits, where an offline data budget informs initial policy and online horizon governs regret. It introduces OtO, a horizon-adaptive method that uses a budget to switch between UCB (optimistic) and LCB (pessimistic) strategies, achieving near-optimal performance relative to the better of UCB and LCB at any time. Theoretical guarantees provide regret bounds against both the optimal policy and the logging policy across the offline-to-online spectrum, with an unknown-horizon extension via horizon doubling. Empirical results on synthetic and real ad-click data show OtO consistently matches or approaches the best of UCB and LCB across horizons and offline-data configurations, suggesting broad applicability to extension beyond MABs, including contextual bandits and reinforcement learning.

Abstract

We consider what we call the offline-to-online learning setting, focusing on stochastic finite-armed bandit problems. In offline-to-online learning, a learner starts with offline data collected from interactions with an unknown environment in a way that is not under the learner's control. Given this data, the learner begins interacting with the environment, gradually improving its initial strategy as it collects more data to maximize its total reward. The learner in this setting faces a fundamental dilemma: if the policy is deployed for only a short period, a suitable strategy (in a number of senses) is the Lower Confidence Bound (LCB) algorithm, which is based on pessimism. LCB can effectively compete with any policy that is sufficiently "covered" by the offline data. However, for longer time horizons, a preferred strategy is the Upper Confidence Bound (UCB) algorithm, which is based on optimism. Over time, UCB converges to the performance of the optimal policy at a rate that is nearly the best possible among all online algorithms. In offline-to-online learning, however, UCB initially explores excessively, leading to worse short-term performance compared to LCB. This suggests that a learner not in control of how long its policy will be in use should start with LCB for short horizons and gradually transition to a UCB-like strategy as more rounds are played. This article explores how and why this transition should occur. Our main result shows that our new algorithm performs nearly as well as the better of LCB and UCB at any point in time. The core idea behind our algorithm is broadly applicable, and we anticipate that our results will extend beyond the multi-armed bandit setting.

Paper Structure

This paper contains 21 sections, 8 theorems, 120 equations, 10 figures, 4 tables, 1 algorithm.

Key Result

Lemma 1

For any instance $\theta$, any algorithm $\mathcal{A}$, and any $T\geq0$ it holds that and

Figures (10)

  • Figure 1: Evolution of the two regret measures when offline samples are uniformly spread between the arms, i.e., $m_i=\frac{m}{K}$ for all $I \in [K]$. The best algorithms are the ones closest to $(0,0)$. From left to right, the horizon is $T=1$, $T=m$ and $T\gg m.$ In the last plot, we assume $T\gg m$. The OtO in the plot uses $\alpha=1$.
  • Figure 2: Evolution of the two regrets when all offline samples are concentrated on two arms, i.e. $m_1=m_2=\frac{m}{2K}$. From left to right, the horizon is $T=1$ and $T=m$. For readability purposes, we do not plot $T\gg m$, as the relative behavior of the algorithms for that horizon would be quite similar to the one at $T=m$, but with an even larger ratio between the horizontal and vertical axis.
  • Figure 3: Regret of the three algorithms for different instances and $T$ values, when the horizon $T$ is given to OtO
  • Figure 4: Regret of the three algorithms for $T=m=2000$, when the horizon $T$ is unknown.
  • Figure 5: The total cumulated reward by each algorithm in Setting 1
  • ...and 5 more figures

Theorems & Definitions (12)

  • Lemma 1
  • Remark 1
  • Theorem 1
  • Remark 2
  • Remark 3
  • Theorem 2: Lower bound on the minimax regret of any algorithm for offline-to-online learning
  • Theorem 3: UCB's upper bound on the minimax regret
  • Proposition 1: Minimax regret of LCB
  • Proposition 2: UCB's regret against the logging policy for $T=1$
  • Proposition 3: UCB's regret against the logging policy for general $T$
  • ...and 2 more