Adaptive Exploration for Latent-State Bandits

Jikai Jin; Kenneth Hung; Sanath Kumar Krishnamurthy; Baoyi Shi; Congshan Zhang

Adaptive Exploration for Latent-State Bandits

Jikai Jin, Kenneth Hung, Sanath Kumar Krishnamurthy, Baoyi Shi, Congshan Zhang

TL;DR

This work addresses key challenges arising from unobserved confounders, such as biased reward estimates and limited state information, by introducing a family of state-model-free bandit algorithms that leverage lagged contextual features and coordinated probing strategies.

Abstract

The multi-armed bandit problem is a core framework for sequential decision-making under uncertainty, but classical algorithms often fail in environments with hidden, time-varying states that confound reward estimation and optimal action selection. We address key challenges arising from unobserved confounders, such as biased reward estimates and limited state information, by introducing a family of state-model-free bandit algorithms that leverage lagged contextual features and coordinated probing strategies. These implicitly track latent states and disambiguate state-dependent reward patterns. Our methods and their adaptive variants can learn optimal policies without explicit state modeling, combining computational efficiency with robust adaptation to non-stationary rewards. Empirical results across diverse settings demonstrate superior performance over classical approaches, and we provide practical recommendations for algorithm selection in real-world applications.

Adaptive Exploration for Latent-State Bandits

TL;DR

Abstract

Paper Structure (27 sections, 1 theorem, 15 equations, 3 figures, 5 tables, 5 algorithms)

This paper contains 27 sections, 1 theorem, 15 equations, 3 figures, 5 tables, 5 algorithms.

Introduction
Key Challenges
Our Contributions
Related Work
Classical and Contextual Bandits
Doubly-robust approaches
Non-stationary and adversarial bandit
Restless bandit and reinforcement learning
Latent Markov decision processes
Setup and Notations
Latent-state bandit model
Dynamic regret
Notations
Lagged Action-Reward as Context
Intuition
...and 12 more sections

Key Result

Theorem 1

Suppose $S=2$ and each state has a unique optimal arm. Suppose further that the Markov chain satisfies $\mathbb{P}(s_{t+1}\neq s_t)\le q$ for all $t$. Consider an idealized periodic probing policy that probes once every $\tau$ rounds. Each probe incurs regret at most $\Delta_{\mathrm{probe}}$ and re Ignoring $\varepsilon_{\mathrm{fp}}$, the bound is minimized at $\tau^* \asymp \sqrt{\Delta_{\mathr

Figures (3)

Figure 1: A directed acyclic graph (DAG) representing the causal relationship among hidden state $s_t$, action $a_t$ and reward $r_t$. The solid arrows represent the latent-state bandit setting we have, while the dashed arrows represent the feedback from past actions and rewards due to the bandit algorithm.
Figure 2: Head-to-head winning rates under different problem-specific parameters.
Figure 3: Frequency of pulling the optimal arm across time.

Theorems & Definitions (3)

Example 1
Theorem 1
proof : Proof sketch

Adaptive Exploration for Latent-State Bandits

TL;DR

Abstract

Adaptive Exploration for Latent-State Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (3)