Table of Contents
Fetching ...

State-Separated SARSA: A Practical Sequential Decision-Making Algorithm with Recovering Rewards

Yuto Tanimoto, Kenji Fukumizu

TL;DR

This work addresses recovering bandits, where arm rewards depend on the time since last pull, by casting the problem as an MDP and introducing State-Separated SARSA (SS-SARSA). SS-SARSA uses State-Separated Q-functions that depend only on a pair of arm states, dramatically reducing the state-space from $s_{\max}^K$ to $s_{\max}^2K^2$ and enabling linear-time updates. The algorithm uses a dedicated Uniform-Explore-First policy for balanced exploration and proves asymptotic convergence to the Bellman optimum under mild assumptions, with extensive simulations showing robust performance across monotone and non-monotone reward structures and heterogeneous arms. The approach offers practical scalability for recovering bandits and provides a concrete framework for on-policy RL in settings with history-dependent rewards, with potential extensions to function approximation for very large problems.

Abstract

While many multi-armed bandit algorithms assume that rewards for all arms are constant across rounds, this assumption does not hold in many real-world scenarios. This paper considers the setting of recovering bandits (Pike-Burke & Grunewalder, 2019), where the reward depends on the number of rounds elapsed since the last time an arm was pulled. We propose a new reinforcement learning (RL) algorithm tailored to this setting, named the State-Separate SARSA (SS-SARSA) algorithm, which treats rounds as states. The SS-SARSA algorithm achieves efficient learning by reducing the number of state combinations required for Q-learning/SARSA, which often suffers from combinatorial issues for large-scale RL problems. Additionally, it makes minimal assumptions about the reward structure and offers lower computational complexity. Furthermore, we prove asymptotic convergence to an optimal policy under mild assumptions. Simulation studies demonstrate the superior performance of our algorithm across various settings.

State-Separated SARSA: A Practical Sequential Decision-Making Algorithm with Recovering Rewards

TL;DR

This work addresses recovering bandits, where arm rewards depend on the time since last pull, by casting the problem as an MDP and introducing State-Separated SARSA (SS-SARSA). SS-SARSA uses State-Separated Q-functions that depend only on a pair of arm states, dramatically reducing the state-space from to and enabling linear-time updates. The algorithm uses a dedicated Uniform-Explore-First policy for balanced exploration and proves asymptotic convergence to the Bellman optimum under mild assumptions, with extensive simulations showing robust performance across monotone and non-monotone reward structures and heterogeneous arms. The approach offers practical scalability for recovering bandits and provides a concrete framework for on-policy RL in settings with history-dependent rewards, with potential extensions to function approximation for very large problems.

Abstract

While many multi-armed bandit algorithms assume that rewards for all arms are constant across rounds, this assumption does not hold in many real-world scenarios. This paper considers the setting of recovering bandits (Pike-Burke & Grunewalder, 2019), where the reward depends on the number of rounds elapsed since the last time an arm was pulled. We propose a new reinforcement learning (RL) algorithm tailored to this setting, named the State-Separate SARSA (SS-SARSA) algorithm, which treats rounds as states. The SS-SARSA algorithm achieves efficient learning by reducing the number of state combinations required for Q-learning/SARSA, which often suffers from combinatorial issues for large-scale RL problems. Additionally, it makes minimal assumptions about the reward structure and offers lower computational complexity. Furthermore, we prove asymptotic convergence to an optimal policy under mild assumptions. Simulation studies demonstrate the superior performance of our algorithm across various settings.
Paper Structure (16 sections, 1 theorem, 18 equations, 10 figures, 1 table, 2 algorithms)

This paper contains 16 sections, 1 theorem, 18 equations, 10 figures, 1 table, 2 algorithms.

Key Result

Theorem 5.1

(Convergence of $Q$-functions) Suppose that the variance of the (stochastic) reward $r$ is finite and $\alpha_t = \frac{1}{t + t_0}$ where $t_0$ is some constant value. This learning rate satisfies Robbins-Monro scheme (i.e. $\sum_{t = 1}^\infty \alpha_t = \infty$ and $\sum_{t = 1}^\infty \alpha_t^2

Figures (10)

  • Figure 1: Small-scale problem ($K = 3, s_{max} = 3, \gamma = 0.99999$)
  • Figure 2: Increasing rewards ($K = 6, \gamma = 0.99999$)
  • Figure 3: Increasing rewards ($K = 10, \gamma = 0.999999$)
  • Figure 4: Increasing-then-decresing rewards ($K = 6, \gamma = 0.99999$)
  • Figure 5: Increasing-then-decresing rewards ($K = 10, \gamma = 0.999999$)
  • ...and 5 more figures

Theorems & Definitions (4)

  • Remark 4.1
  • Theorem 5.1
  • Remark 5.2
  • Remark 5.3