On Convergence of Average-Reward Q-Learning in Weakly Communicating Markov Decision Processes
Yi Wan, Huizhen Yu, Richard S. Sutton
TL;DR
The paper advances the theoretical foundations of average-reward reinforcement learning by proving almost-sure convergence of RVI-based Q-learning for weakly communicating MDPs, including when solutions to the average-reward optimality equations may have multiple degrees of freedom. It introduces a unified stochastic approximation framework for asynchronous RVI updates, characterizes the convergence sets as compact, connected (potentially nonconvex) manifolds with one fewer degree of freedom than the full solution set, and demonstrates convergence for two hierarchical option-based algorithms in weakly communicating SMDPs. The results broaden applicability beyond unichain assumptions and provide deeper insight into the geometry of solutions under average-reward criteria, with empirical validation and discussion of extensions to off-policy and intra-option settings. This work has practical implications for scalable, long-horizon RL in complex state-spaces and hierarchical control tasks where average performance is crucial.
Abstract
This paper analyzes reinforcement learning (RL) algorithms for Markov decision processes (MDPs) under the average-reward criterion. We focus on Q-learning algorithms based on relative value iteration (RVI), which are model-free stochastic analogues of the classical RVI method for average-reward MDPs. These algorithms have low per-iteration complexity, making them well-suited for large state space problems. We extend the almost-sure convergence analysis of RVI Q-learning algorithms developed by Abounadi, Bertsekas, and Borkar (2001) from unichain to weakly communicating MDPs. This extension is important both practically and theoretically: weakly communicating MDPs cover a much broader range of applications compared to unichain MDPs, and their optimality equations have a richer solution structure (with multiple degrees of freedom), introducing additional complexity in proving algorithmic convergence. We also characterize the sets to which RVI Q-learning algorithms converge, showing that they are compact, connected, potentially nonconvex, and comprised of solutions to the average-reward optimality equation, with exactly one less degree of freedom than the general solution set of this equation. Furthermore, we extend our analysis to two RVI-based hierarchical average-reward RL algorithms using the options framework, proving their almost-sure convergence and characterizing their sets of convergence under the assumption that the underlying semi-Markov decision process is weakly communicating.
