A Reduction Algorithm for Markovian Contextual Linear Bandits

Kaan Buyukkalayci; Osama Hanna; Christina Fragouli

A Reduction Algorithm for Markovian Contextual Linear Bandits

Kaan Buyukkalayci, Osama Hanna, Christina Fragouli

Abstract

Recent work shows that when contexts are drawn i.i.d., linear contextual bandits can be reduced to single-context linear bandits. This ``contexts are cheap" perspective is highly advantageous, as it allows for sharper finite-time analyses and leverages mature techniques from the linear bandit literature, such as those for misspecification and adversarial corruption. Motivated by applications with temporally correlated availability, we extend this perspective to Markovian contextual linear bandits, where the action set evolves via an exogenous Markov chain. Our main contribution is a reduction that applies under uniform geometric ergodicity. We construct a stationary surrogate action set to solve the problem using a standard linear bandit oracle, employing a delayed-update scheme to control the bias induced by the nonstationary conditional context distributions. We further provide a phased algorithm for unknown transition distributions that learns the surrogate mapping online. In both settings, we obtain a high-probability worst-case regret bound matching that of the underlying linear bandit oracle, with only lower-order dependence on the mixing time.

A Reduction Algorithm for Markovian Contextual Linear Bandits

Abstract

Paper Structure (25 sections, 7 theorems, 85 equations, 1 figure, 2 tables, 2 algorithms)

This paper contains 25 sections, 7 theorems, 85 equations, 1 figure, 2 tables, 2 algorithms.

Introduction
Related Work
Linear and contextual linear bandits.
Asymptotically optimal exploration in contextual linear bandits.
Stochastic contextual linear bandits with random action sets and reduction.
Adversarial losses, stochastic availability, and reduction-based approaches.
Bandits and reinforcement learning with Markov structure.
Latent or partially observed context dynamics and non-stationarity.
Setup and Notation
Summary of Regret Guarantees
Known Stationary Distribution
Unknown Stationary Distribution
Numerical Results
Conclusion
Proof of Lemma \ref{['lem:TV']}
...and 10 more sections

Key Result

Lemma 5.1

For any two probability measures $\rho,\rho'$ on the measurable space of contexts $(\mathsf{S},\mathcal{F}_\mathsf{S})$ and any $\theta \in \Theta$. It holds that

Figures (1)

Figure 1: Mean cumulative regret over 20 runs; shaded regions indicate $\pm$ one standard error across runs around the mean.

Theorems & Definitions (20)

Definition 3.1: Uniform Geometric Ergodicity
proof
Lemma 5.1
Theorem 5.2
proof : Proof Sketch
Corollary 5.3
proof : Proof Sketch
Theorem 6.1
Lemma 6.2
proof : Proof Sketch
...and 10 more

A Reduction Algorithm for Markovian Contextual Linear Bandits

Abstract

A Reduction Algorithm for Markovian Contextual Linear Bandits

Authors

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (20)