Table of Contents
Fetching ...

Parameter-Free Dynamic Regret for Unconstrained Linear Bandits

Alberto Rumi, Andrew Jacobsen, Nicolò Cesa-Bianchi, Fabio Vitale

Abstract

We study dynamic regret minimization in unconstrained adversarial linear bandit problems. In this setting, a learner must minimize the cumulative loss relative to an arbitrary sequence of comparators $\boldsymbol{u}_1,\ldots,\boldsymbol{u}_T$ in $\mathbb{R}^d$, but receives only point-evaluation feedback on each round. We provide a simple approach to combining the guarantees of several bandit algorithms, allowing us to optimally adapt to the number of switches $S_T = \sum_t\mathbb{I}\{\boldsymbol{u}_t \neq \boldsymbol{u}_{t-1}\}$ of an arbitrary comparator sequence. In particular, we provide the first algorithm for linear bandits achieving the optimal regret guarantee of order $\mathcal{O}\big(\sqrt{d(1+S_T) T}\big)$ up to poly-logarithmic terms without prior knowledge of $S_T$, thus resolving a long-standing open problem.

Parameter-Free Dynamic Regret for Unconstrained Linear Bandits

Abstract

We study dynamic regret minimization in unconstrained adversarial linear bandit problems. In this setting, a learner must minimize the cumulative loss relative to an arbitrary sequence of comparators in , but receives only point-evaluation feedback on each round. We provide a simple approach to combining the guarantees of several bandit algorithms, allowing us to optimally adapt to the number of switches of an arbitrary comparator sequence. In particular, we provide the first algorithm for linear bandits achieving the optimal regret guarantee of order up to poly-logarithmic terms without prior knowledge of , thus resolving a long-standing open problem.

Paper Structure

This paper contains 8 sections, 3 theorems, 18 equations, 1 figure, 2 algorithms.

Key Result

Proposition 2.0

Let $\mathcal{A}_{1},\ldots,\mathcal{A}_{N}$ be online learning algorithms and let $\boldsymbol{w}_{t}^{(i)}$ denote the output of $\mathcal{A}_{i}$ on round $t$. Suppose that for all $i$, $\mathcal{A}_{i}$ guarantees $R_{T}^{\mathcal{A}_{i}}(\boldsymbol{0})=\sum_{t=1}^{T} f_{t}(\boldsymbol{w}_{t}^{ where $R_T^{\mathcal{A}_n}(N\boldsymbol{u}_1,\ldots,N\boldsymbol{u}_T)$ denotes the dynamic regret

Figures (1)

  • Figure 1: Illustration of how the Uniform Sampling interface interacts with each base algorithm $\mathcal{A}_i$. Each base algorithm internally applies the direction and scale decomposition, using its own hyperparameters.

Theorems & Definitions (6)

  • Proposition 2.0
  • proof
  • Proposition 3.0
  • proof
  • Theorem 3.1
  • proof