Table of Contents
Fetching ...

Restless Bandit Problem with Rewards Generated by a Linear Gaussian Dynamical System

Jonathan Gornet, Bruno Sinopoli

TL;DR

The paper addresses a restless bandit problem with rewards generated by a LGDS, proposing to predict rewards via a learned modified Kalman filter and to select actions using an uncertainty-aware UBSS strategy.A linear predictive model expresses each reward as a linear combination of past rewards, enabling cross-action reward estimation and tractable identification of $G_{c_a|\mathbf{c}}$ for all actions.It proves a bounded prediction error for the modified Kalman filter and derives a high-probability regret bound for UBSS, along with numerical results showing competitive performance against standard SMAB algorithms in LGDS settings.The approach offers a principled method for control under continuous-state dynamics in non-stationary bandit environments, with potential applicability to hyperparameter tuning and other DL-based optimization tasks.

Abstract

Decision-making under uncertainty is a fundamental problem encountered frequently and can be formulated as a stochastic multi-armed bandit problem. In the problem, the learner interacts with an environment by choosing an action at each round, where a round is an instance of an interaction. In response, the environment reveals a reward, which is sampled from a stochastic process, to the learner. The goal of the learner is to maximize cumulative reward. In this work, we assume that the rewards are the inner product of an action vector and a state vector generated by a linear Gaussian dynamical system. To predict the reward for each action, we propose a method that takes a linear combination of previously observed rewards for predicting each action's next reward. We show that, regardless of the sequence of previous actions chosen, the reward sampled for any previously chosen action can be used for predicting another action's future reward, i.e. the reward sampled for action 1 at round $t-1$ can be used for predicting the reward for action $2$ at round $t$. This is accomplished by designing a modified Kalman filter with a matrix representation that can be learned for reward prediction. Numerical evaluations are carried out on a set of linear Gaussian dynamical systems and are compared with 2 other well-known stochastic multi-armed bandit algorithms.

Restless Bandit Problem with Rewards Generated by a Linear Gaussian Dynamical System

TL;DR

The paper addresses a restless bandit problem with rewards generated by a LGDS, proposing to predict rewards via a learned modified Kalman filter and to select actions using an uncertainty-aware UBSS strategy.A linear predictive model expresses each reward as a linear combination of past rewards, enabling cross-action reward estimation and tractable identification of $G_{c_a|\mathbf{c}}$ for all actions.It proves a bounded prediction error for the modified Kalman filter and derives a high-probability regret bound for UBSS, along with numerical results showing competitive performance against standard SMAB algorithms in LGDS settings.The approach offers a principled method for control under continuous-state dynamics in non-stationary bandit environments, with potential applicability to hyperparameter tuning and other DL-based optimization tasks.

Abstract

Decision-making under uncertainty is a fundamental problem encountered frequently and can be formulated as a stochastic multi-armed bandit problem. In the problem, the learner interacts with an environment by choosing an action at each round, where a round is an instance of an interaction. In response, the environment reveals a reward, which is sampled from a stochastic process, to the learner. The goal of the learner is to maximize cumulative reward. In this work, we assume that the rewards are the inner product of an action vector and a state vector generated by a linear Gaussian dynamical system. To predict the reward for each action, we propose a method that takes a linear combination of previously observed rewards for predicting each action's next reward. We show that, regardless of the sequence of previous actions chosen, the reward sampled for any previously chosen action can be used for predicting another action's future reward, i.e. the reward sampled for action 1 at round can be used for predicting the reward for action at round . This is accomplished by designing a modified Kalman filter with a matrix representation that can be learned for reward prediction. Numerical evaluations are carried out on a set of linear Gaussian dynamical systems and are compared with 2 other well-known stochastic multi-armed bandit algorithms.
Paper Structure (14 sections, 8 theorems, 77 equations, 1 figure, 1 algorithm)

This paper contains 14 sections, 8 theorems, 77 equations, 1 figure, 1 algorithm.

Key Result

Lemma 1

Let $P_a$, $a \in [k]$ be the steady state solution of the Kalman filter for each action $c_a \in \mathcal{A}$, $P_a = g\left(P_a,c_a\right)$, where $g\left(P_a,c_a\right)$ is defined in eq:g_definition. Define $P_{\overline{a}} \succeq 0$ to be the steady-state error covariance matrix of the Kalman

Figures (1)

  • Figure 1: Comparison algorithm's regret normalized with respect to UBSS's regret. A positive percent implies that UBSS has a lower regret than the compared algorithm.

Theorems & Definitions (17)

  • Remark 1
  • Lemma 1
  • Theorem 1
  • Remark 2
  • Theorem 2
  • proof
  • proof
  • Lemma 2
  • proof
  • Theorem 3
  • ...and 7 more