Offline-to-online hyperparameter transfer for stochastic bandits

Dravyansh Sharma; Arun Sai Suggala

Offline-to-online hyperparameter transfer for stochastic bandits

Dravyansh Sharma, Arun Sai Suggala

TL;DR

This work tackles the problem of transferring hyperparameters for stochastic bandit algorithms from offline tasks to online deployment when tasks are drawn from an unknown distribution. It proves an impossibility result for fully online hyperparameter tuning and then builds a formal transfer-learning framework that characterizes inter-task and intra-task sample complexities via the derandomized dual complexity $Q_{D}$. The authors instantiate the framework to tune the exploration parameter in UCB and LinUCB and the noise parameter in GP-UCB, using an empirical risk minimization approach over offline tasks and a efficient critical-points procedure to handle continuous hyperparameters. Empirical results on synthetic and real data demonstrate substantial gains over corralling baselines, with typically small $Q_{D}$ indicating practical efficiency of the transfer method.

Abstract

Classic algorithms for stochastic bandits typically use hyperparameters that govern their critical properties such as the trade-off between exploration and exploitation. Tuning these hyperparameters is a problem of great practical significance. However, this is a challenging problem and in certain cases is information theoretically impossible. To address this challenge, we consider a practically relevant transfer learning setting where one has access to offline data collected from several bandit problems (tasks) coming from an unknown distribution over the tasks. Our aim is to use this offline data to set the hyperparameters for a new task drawn from the unknown distribution. We provide bounds on the inter-task (number of tasks) and intra-task (number of arm pulls for each task) sample complexity for learning near-optimal hyperparameters on unseen tasks drawn from the distribution. Our results apply to several classic algorithms, including tuning the exploration parameters in UCB and LinUCB and the noise parameter in GP-UCB. Our experiments indicate the significance and effectiveness of the transfer of hyperparameters from offline problems in online learning with stochastic bandit feedback.

Offline-to-online hyperparameter transfer for stochastic bandits

TL;DR

. The authors instantiate the framework to tune the exploration parameter in UCB and LinUCB and the noise parameter in GP-UCB, using an empirical risk minimization approach over offline tasks and a efficient critical-points procedure to handle continuous hyperparameters. Empirical results on synthetic and real data demonstrate substantial gains over corralling baselines, with typically small

indicating practical efficiency of the transfer method.

Abstract

Paper Structure (30 sections, 18 theorems, 27 equations, 6 figures, 1 table, 5 algorithms)

This paper contains 30 sections, 18 theorems, 27 equations, 6 figures, 1 table, 5 algorithms.

Introduction
Related Work
Transfer Learning.
Corralling.
Model selection in Bandits.
Data-driven Hyper-parameter Selection.
Hyperparameter free Bandit Algorithms.
Distribution Shift.
Meta-learning Bayesian Bandits.
Preliminaries
Impossibility of Hyperparameter Tuning
Formal framework for transfer learning
Derandomized dual complexity
Sample complexity of bandit hyperparameter tuning
Tuning the exploration parameter
...and 15 more sections

Key Result

Theorem 4.1

Let $\Pi$ be the set of MAB problems with arm rewards sampled from Gaussian distributions with variance belonging to the set $[0, B^2].$ Let $\mathcal{A}$ be the set of UCB policies, with $\rho$ being the scale parameter multiplying the confidence width. Then for any meta algorithm $\widetilde{A}$,

Figures (6)

Figure 1: Variation of (estimated) expected regret with the exploration parameter $\alpha\in[0,4]$ for two-arm stochastic bandits for symmetric Gaussian distributions.
Figure 2: Comparison of Algorithm \ref{['alg:tuned-ucb']} to corralling based algorithms Corralagarwal2017corralling and Corral-stochasticarora2021corralling.
Figure 3: Comparison of Algorithm \ref{['alg:tuned-ucb']} to corralling based algorithms Corralagarwal2017corralling and Corral-stochasticarora2021corralling on 2-arm bandits with large arm-reward gap.
Figure 4: Comparison of Algorithm \ref{['alg:tuned-ucb']} with corralling baselines on 2-arm bandits synthetic datasets with uniform and Gaussian arm reward distributions.
Figure 5: Variation of regret on test (online) tasks with number of training tasks $N$ for tuning UCB$(\alpha)$.
...and 1 more figures

Theorems & Definitions (27)

Theorem 4.1
Definition 1
Theorem 6.1
Theorem 6.2
Example 1
Theorem 7.1
Theorem 7.2
Theorem 7.3
Theorem 7.4
Theorem 8.1
...and 17 more

Offline-to-online hyperparameter transfer for stochastic bandits

TL;DR

Abstract

Offline-to-online hyperparameter transfer for stochastic bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (27)