Table of Contents
Fetching ...

Invariance-Based Dynamic Regret Minimization

Margherita Lazzaretto, Jonas Peters, Niklas Pfister

TL;DR

ISD-linUCB is introduced, an algorithm that uses past data to learn invariances in the reward model and subsequently exploits them to improve online performance, and shows both theoretically and empirically that leveraging invariance reduces the problem dimensionality.

Abstract

We consider stochastic non-stationary linear bandits where the linear parameter connecting contexts to the reward changes over time. Existing algorithms in this setting localize the policy by gradually discarding or down-weighting past data, effectively shrinking the time horizon over which learning can occur. However, in many settings historical data may still carry partial information about the reward model. We propose to leverage such data while adapting to changes, by assuming the reward model decomposes into stationary and non-stationary components. Based on this assumption, we introduce ISD-linUCB, an algorithm that uses past data to learn invariances in the reward model and subsequently exploits them to improve online performance. We show both theoretically and empirically that leveraging invariance reduces the problem dimensionality, yielding significant regret improvements in fast-changing environments when sufficient historical data is available.

Invariance-Based Dynamic Regret Minimization

TL;DR

ISD-linUCB is introduced, an algorithm that uses past data to learn invariances in the reward model and subsequently exploits them to improve online performance, and shows both theoretically and empirically that leveraging invariance reduces the problem dimensionality.

Abstract

We consider stochastic non-stationary linear bandits where the linear parameter connecting contexts to the reward changes over time. Existing algorithms in this setting localize the policy by gradually discarding or down-weighting past data, effectively shrinking the time horizon over which learning can occur. However, in many settings historical data may still carry partial information about the reward model. We propose to leverage such data while adapting to changes, by assuming the reward model decomposes into stationary and non-stationary components. Based on this assumption, we introduce ISD-linUCB, an algorithm that uses past data to learn invariances in the reward model and subsequently exploits them to improve online performance. We show both theoretically and empirically that leveraging invariance reduces the problem dimensionality, yielding significant regret improvements in fast-changing environments when sufficient historical data is available.
Paper Structure (29 sections, 8 theorems, 87 equations, 7 figures, 1 algorithm)

This paper contains 29 sections, 8 theorems, 87 equations, 7 figures, 1 algorithm.

Key Result

Lemma 1

Let $\{F_t\}_{t=0}^\infty$ be a filtration. Let $\{\epsilon_t\}_{t=1}^\infty$ be a real-valued stochastic process such that $\epsilon_t$ is $F_t$-measurable and sub-Gaussian conditionally on $F_{t-1}$ with parameter $\sigma>0$. Let $\{\varphi(X_t, a_t)\}_{t=1}^\infty$ be an $\mathbb{R}^p$-valued sto Moreover, if, for all $t\in[T]$, $\|\varphi(X_t, a_t)\|_2\le L$, then

Figures (7)

  • Figure 1: ISD-linUCB exploits historical data to improve reward predictions used by the UCB policy.
  • Figure 2: Regret of ISD-linUCB with oracle $(\mathcal{S}^{\operatorname{inv}}, \mathcal{S}^{\operatorname{res}})$ over $T=100$ rounds for $p^{\operatorname{res}}\in\{2,4,6,8\}$. For each $p^{\operatorname{res}}$ the experiment is repeated $20$ times. The left plot shows the average performance and the standard deviation over the $20$ repetitions, the right plot shows the distribution of the regret over the $20$ repetitions.
  • Figure 3: Cumulative regret of standard LinUCB and ISD-linUCB with oracle $(\mathcal{S}^{\operatorname{inv}}, \mathcal{S}^{\operatorname{res}})$ for $T=100$ and increasing values of context-action feature dimension $p$. For ISD-linUCB, the invariant component $\beta^{\operatorname{inv}}$ is estimated using $T_0=2000$ observations. $p^{\operatorname{inv}}$ varies from $3$ to $10$, while the $p^{\operatorname{res}}$ is fixed to $2$. For each $p$ the experiment is repeated $20$ times.
  • Figure 4: Cumulative regret for $T_0\in\{1000, 3500, 8000\}$ (20 repetitions) for ISD-linUCB, in comparison with the same algorithm having oracle information and with the standard linUCB alogorithm (unaffected by $T_0$). For increasing $T_0$, the regret of ISD-linUCB gets closer to the one of the oracle version.
  • Figure 5: Projection error for $T_0\in\{1000, 3500, 8000\}$. The plot on the left shows the same values multiplied by $\sqrt{T_0}$, confirming our assumption.
  • ...and 2 more figures

Theorems & Definitions (16)

  • Lemma 1: abbasi2011improved, Theorem 1
  • Theorem 1
  • Theorem 2
  • proof : Proof sketch of Theorem \ref{['thm:estimated_beta_inv']}
  • Lemma 2
  • proof : Proof sketch
  • Lemma 3
  • proof : Proof sketch
  • Theorem 3
  • proof : Proof of Theorem \ref{['thm:oracle_regret']}
  • ...and 6 more