State-Separated SARSA: A Practical Sequential Decision-Making Algorithm with Recovering Rewards
Yuto Tanimoto, Kenji Fukumizu
TL;DR
This work addresses recovering bandits, where arm rewards depend on the time since last pull, by casting the problem as an MDP and introducing State-Separated SARSA (SS-SARSA). SS-SARSA uses State-Separated Q-functions that depend only on a pair of arm states, dramatically reducing the state-space from $s_{\max}^K$ to $s_{\max}^2K^2$ and enabling linear-time updates. The algorithm uses a dedicated Uniform-Explore-First policy for balanced exploration and proves asymptotic convergence to the Bellman optimum under mild assumptions, with extensive simulations showing robust performance across monotone and non-monotone reward structures and heterogeneous arms. The approach offers practical scalability for recovering bandits and provides a concrete framework for on-policy RL in settings with history-dependent rewards, with potential extensions to function approximation for very large problems.
Abstract
While many multi-armed bandit algorithms assume that rewards for all arms are constant across rounds, this assumption does not hold in many real-world scenarios. This paper considers the setting of recovering bandits (Pike-Burke & Grunewalder, 2019), where the reward depends on the number of rounds elapsed since the last time an arm was pulled. We propose a new reinforcement learning (RL) algorithm tailored to this setting, named the State-Separate SARSA (SS-SARSA) algorithm, which treats rounds as states. The SS-SARSA algorithm achieves efficient learning by reducing the number of state combinations required for Q-learning/SARSA, which often suffers from combinatorial issues for large-scale RL problems. Additionally, it makes minimal assumptions about the reward structure and offers lower computational complexity. Furthermore, we prove asymptotic convergence to an optimal policy under mild assumptions. Simulation studies demonstrate the superior performance of our algorithm across various settings.
