Table of Contents
Fetching ...

Adaptive Exploration for Latent-State Bandits

Jikai Jin, Kenneth Hung, Sanath Kumar Krishnamurthy, Baoyi Shi, Congshan Zhang

TL;DR

This work addresses key challenges arising from unobserved confounders, such as biased reward estimates and limited state information, by introducing a family of state-model-free bandit algorithms that leverage lagged contextual features and coordinated probing strategies.

Abstract

The multi-armed bandit problem is a core framework for sequential decision-making under uncertainty, but classical algorithms often fail in environments with hidden, time-varying states that confound reward estimation and optimal action selection. We address key challenges arising from unobserved confounders, such as biased reward estimates and limited state information, by introducing a family of state-model-free bandit algorithms that leverage lagged contextual features and coordinated probing strategies. These implicitly track latent states and disambiguate state-dependent reward patterns. Our methods and their adaptive variants can learn optimal policies without explicit state modeling, combining computational efficiency with robust adaptation to non-stationary rewards. Empirical results across diverse settings demonstrate superior performance over classical approaches, and we provide practical recommendations for algorithm selection in real-world applications.

Adaptive Exploration for Latent-State Bandits

TL;DR

This work addresses key challenges arising from unobserved confounders, such as biased reward estimates and limited state information, by introducing a family of state-model-free bandit algorithms that leverage lagged contextual features and coordinated probing strategies.

Abstract

The multi-armed bandit problem is a core framework for sequential decision-making under uncertainty, but classical algorithms often fail in environments with hidden, time-varying states that confound reward estimation and optimal action selection. We address key challenges arising from unobserved confounders, such as biased reward estimates and limited state information, by introducing a family of state-model-free bandit algorithms that leverage lagged contextual features and coordinated probing strategies. These implicitly track latent states and disambiguate state-dependent reward patterns. Our methods and their adaptive variants can learn optimal policies without explicit state modeling, combining computational efficiency with robust adaptation to non-stationary rewards. Empirical results across diverse settings demonstrate superior performance over classical approaches, and we provide practical recommendations for algorithm selection in real-world applications.
Paper Structure (27 sections, 1 theorem, 15 equations, 3 figures, 5 tables, 5 algorithms)

This paper contains 27 sections, 1 theorem, 15 equations, 3 figures, 5 tables, 5 algorithms.

Key Result

Theorem 1

Suppose $S=2$ and each state has a unique optimal arm. Suppose further that the Markov chain satisfies $\mathbb{P}(s_{t+1}\neq s_t)\le q$ for all $t$. Consider an idealized periodic probing policy that probes once every $\tau$ rounds. Each probe incurs regret at most $\Delta_{\mathrm{probe}}$ and re Ignoring $\varepsilon_{\mathrm{fp}}$, the bound is minimized at $\tau^* \asymp \sqrt{\Delta_{\mathr

Figures (3)

  • Figure 1: A directed acyclic graph (DAG) representing the causal relationship among hidden state $s_t$, action $a_t$ and reward $r_t$. The solid arrows represent the latent-state bandit setting we have, while the dashed arrows represent the feedback from past actions and rewards due to the bandit algorithm.
  • Figure 2: Head-to-head winning rates under different problem-specific parameters.
  • Figure 3: Frequency of pulling the optimal arm across time.

Theorems & Definitions (3)

  • Example 1
  • Theorem 1
  • proof : Proof sketch