On Convergence of Average-Reward Q-Learning in Weakly Communicating Markov Decision Processes

Yi Wan; Huizhen Yu; Richard S. Sutton

On Convergence of Average-Reward Q-Learning in Weakly Communicating Markov Decision Processes

Yi Wan, Huizhen Yu, Richard S. Sutton

TL;DR

The paper advances the theoretical foundations of average-reward reinforcement learning by proving almost-sure convergence of RVI-based Q-learning for weakly communicating MDPs, including when solutions to the average-reward optimality equations may have multiple degrees of freedom. It introduces a unified stochastic approximation framework for asynchronous RVI updates, characterizes the convergence sets as compact, connected (potentially nonconvex) manifolds with one fewer degree of freedom than the full solution set, and demonstrates convergence for two hierarchical option-based algorithms in weakly communicating SMDPs. The results broaden applicability beyond unichain assumptions and provide deeper insight into the geometry of solutions under average-reward criteria, with empirical validation and discussion of extensions to off-policy and intra-option settings. This work has practical implications for scalable, long-horizon RL in complex state-spaces and hierarchical control tasks where average performance is crucial.

Abstract

This paper analyzes reinforcement learning (RL) algorithms for Markov decision processes (MDPs) under the average-reward criterion. We focus on Q-learning algorithms based on relative value iteration (RVI), which are model-free stochastic analogues of the classical RVI method for average-reward MDPs. These algorithms have low per-iteration complexity, making them well-suited for large state space problems. We extend the almost-sure convergence analysis of RVI Q-learning algorithms developed by Abounadi, Bertsekas, and Borkar (2001) from unichain to weakly communicating MDPs. This extension is important both practically and theoretically: weakly communicating MDPs cover a much broader range of applications compared to unichain MDPs, and their optimality equations have a richer solution structure (with multiple degrees of freedom), introducing additional complexity in proving algorithmic convergence. We also characterize the sets to which RVI Q-learning algorithms converge, showing that they are compact, connected, potentially nonconvex, and comprised of solutions to the average-reward optimality equation, with exactly one less degree of freedom than the general solution set of this equation. Furthermore, we extend our analysis to two RVI-based hierarchical average-reward RL algorithms using the options framework, proving their almost-sure convergence and characterizing their sets of convergence under the assumption that the underlying semi-Markov decision process is weakly communicating.

On Convergence of Average-Reward Q-Learning in Weakly Communicating Markov Decision Processes

TL;DR

Abstract

Paper Structure (28 sections, 20 theorems, 93 equations, 4 figures)

This paper contains 28 sections, 20 theorems, 93 equations, 4 figures.

Introduction
Background
MDPs with the Average-Reward Criterion
Weakly Communicating MDPs: Optimality Equations & Solution Structures
Relative Value Iteration
Convergence of RVI Q-Learning
Algorithmic Framework
Main Results
Empirical Verification of the Convergence Theorem
Convergence of Options Algorithms
Average-Reward Weakly Communicating SMDPs
Average-Reward Learning with Options: Problem Formulations
Inter-Option Formulation
Intra-Option Formulation
Inter-Option Algorithm
...and 13 more sections

Key Result

Theorem 3.1

If the MDP is weakly communicating and assu: f holds, then $\mathcal{Q}_\infty$ is nonempty, compact, connected, and possibly nonconvex.

Figures (4)

Figure 1: Three examples of communicating MDPs with or without the uniqueness solution property. All these MDPs have two states $\{\emph{1, 2}\}$ and two actions $\{\texttt{solid}, \texttt{dashed}\}$ with deterministic effects. The directed solid and dashed curves between states depict deterministic state transitions corresponding to actions solid and dashed, respectively, with associated rewards indicated by numbers. Subfigure (a): a unichain MDP; (b): an MDP that is not unichain but has unique solutions in $\mathcal{Q}$ (up to an additive constant); (c): an MDP without the uniqueness solution property.
Figure 2: Tested MDPs for verifying the convergence of Differential Q-learning and RVI Q-learning when the solution set has more than one degree of freedom.
Figure 3: Dynamics of the estimated values produced by Differential Q-learning and RVI Q-learning in the two MDPs shown in \ref{['fig: c1 two tested mdps']}. The green regions denote $\mathcal{Q}^o$.
Figure 4: An illustrative MDP example. Left: The example MDP has three states $\{\emph{1, 2, 3}\}$ and two actions $\{\texttt{solid}, \texttt{dashed}\}$ with deterministic effects. The directed solid and dashed curves between states depict deterministic state transitions corresponding to actions solid and dashed, respectively. Taking action solid (resp. dashed) at state 3 (resp. state 1) results in a reward of $-1$ (resp. $-2$), while all other rewards are $0$. Right: Visualization of the solution set $\mathcal{V}$ and its subset $\mathcal{V}(\mathcal{Q}_{s})$, comprising the state value functions corresponding to the solutions in $\mathcal{Q}_{s}$. The red and blue regions together represent $\mathcal{V}$, while the two yellow line segments correspond to $\mathcal{V}(\mathcal{Q}_{s})$. Both sets are nonconvex.

Theorems & Definitions (35)

Definition 1
Example 1
Remark 3.1
Example 2: Differential Q-learning wan2021learning
Example 3
Theorem 3.1
Theorem 3.2: convergence theorem
Remark 3.2
Corollary 3.1
Proposition 4.1: SMDP--MDP connection
...and 25 more

On Convergence of Average-Reward Q-Learning in Weakly Communicating Markov Decision Processes

TL;DR

Abstract

On Convergence of Average-Reward Q-Learning in Weakly Communicating Markov Decision Processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (35)