Average-reward reinforcement learning in semi-Markov decision processes via relative value iteration
Huizhen Yu, Yi Wan, Richard S. Sutton
TL;DR
The paper advances average-reward reinforcement learning for semi-Markov decision processes by embedding a generalized, asynchronous RVI Q-learning algorithm within the Borkar–Meyn stochastic approximation framework. It introduces a broad monotonicity-driven mechanism (SISTr) for stabilizing reward-rate estimates and proves convergence to the average-reward optimality equation's solution set, with a regime guaranteeing convergence to a unique path-dependent solution via shadowing. The results extend prior RVI Q-learning analyses to weakly communicating SMDPs, relax key assumptions, and provide implementable conditions on stepsizes and asynchrony. These insights enhance model-free RL for continuous-time or variable-holding-time environments and pave the way for distributed or hierarchical extensions under asynchronous data streams.
Abstract
This paper applies the authors' recent results on asynchronous stochastic approximation (SA) in the Borkar-Meyn framework to reinforcement learning in average-reward semi-Markov decision processes (SMDPs). We establish the convergence of an asynchronous SA analogue of Schweitzer's classical relative value iteration algorithm, RVI Q-learning, for finite-space, weakly communicating SMDPs. In particular, we show that the algorithm converges almost surely to a compact, connected subset of solutions to the average-reward optimality equation, with convergence to a unique, sample path-dependent solution under additional stepsize and asynchrony conditions. Moreover, to make full use of the SA framework, we introduce new monotonicity conditions for estimating the optimal reward rate in RVI Q-learning. These conditions substantially expand the previously considered algorithmic framework and are addressed through novel arguments in the stability and convergence analysis of RVI Q-learning.
