Q-Learning for Continuous State and Action MDPs under Average Cost Criteria

Ali Devran Kara; Serdar Yuksel

Q-Learning for Continuous State and Action MDPs under Average Cost Criteria

Ali Devran Kara, Serdar Yuksel

TL;DR

This work tackles the challenge of solving infinite-horizon average-cost MDPs with continuous state and action spaces by developing discretization-based approximations and quantized Q-learning methods. It establishes near-optimality guarantees for finite, quantized models under weaker continuity assumptions (weak and Wasserstein) than previously required, and proves convergence of both synchronous and asynchronous Q-learning toward the optimal Q-values of these approximate models. By connecting discretized solutions to the original problem via contraction results and ergodicity conditions, the paper demonstrates that the learned policies are near-optimal for the underlying continuous-space MDP. The combination of finite-state/action approximations, rigorous error bounds, and convergent Q-learning offers a practical framework for planning and learning in average-cost settings with continuous dynamics.

Abstract

For infinite-horizon average-cost criterion problems, there exist relatively few rigorous approximation and reinforcement learning results. In this paper, for Markov Decision Processes (MDPs) with standard Borel spaces, (i) we first provide a discretization based approximation method for MDPs with continuous spaces under average cost criteria, and provide error bounds for approximations when the dynamics are only weakly continuous (for asymptotic convergence of errors as the grid sizes vanish) or Wasserstein continuous (with a rate in approximation as the grid sizes vanish) under certain ergodicity assumptions. In particular, we relax the total variation condition given in prior work to weak continuity or Wasserstein continuity. (ii) We provide synchronous and asynchronous (quantized) Q-learning algorithms for continuous spaces via quantization (where the quantized state is taken to be the actual state in corresponding Q-learning algorithms presented in the paper), and establish their convergence. (iii) We finally show that the convergence is to the optimal Q values of a finite approximate model constructed via quantization, which implies near optimality of the arrived solution. Our Q-learning convergence results and their convergence to near optimality are new for continuous spaces, and the proof method is new even for finite spaces, to our knowledge.

Q-Learning for Continuous State and Action MDPs under Average Cost Criteria

TL;DR

Abstract

Paper Structure (21 sections, 14 theorems, 143 equations, 3 figures, 2 algorithms)

This paper contains 21 sections, 14 theorems, 143 equations, 3 figures, 2 algorithms.

Introduction
Literature Review
Average Cost Optimality Equation and Contraction Properties of Relative Value Iteration
Convergence notions for probability measures and regularity properties of transition kernels
The average cost optimality equation
Contraction via the span semi-norm
Contraction under the sup norm by equivalence with a discounted cost problem
Near Optimality of Quantized State and Action Space Approximations
Finite Action Approximate MDP: Quantization of the Action Space
Finite State Approximate MDP: Quantization of the State Space
Finite MDP Approximation via Wasserstein Continuity with Modulus of Continuity in Approximation
Finite Approximations via Weak Continuity and Asymptotic Optimality
Quantized Q-Learning for Continuous Spaces under Infinite Horizon Average Cost Criterion
Synchronous quantized Q Learning for continuous space average cost MDPs
Asynchronous quantized Q Learning for continuous space average cost MDPs
...and 6 more sections

Key Result

Theorem 2.2

surveyHernandezLermaMCP \newlabelACOEEqn[Verification Theorem] \newlabelver_thm Let $j,h,f$ be a canonical triplet. a) If $j$ is a constant and for all $x$ and under every policy $\gamma$, then the stationary deterministic policy $\gamma^* = \{f,f,f,\cdots\}$ is optimal so that where b) If $j$, considered above, is not a constant and depends on $x$, then provided that (condConv01) holds. Furth

Figures (3)

Figure 6.1: Relative value function convergence for synchronous(left) and asynchronous(right) algorithms
Figure 6.2: Algorithm convergence under different initial conditions
Figure 6.3: Learned policy performance under different quantization rates

Theorems & Definitions (31)

Example 2.1
Definition 2.1
Theorem 2.2
Theorem 2.3
Remark 2.1
Theorem 2.4
Definition 3.1
Theorem 3.2
Theorem 3.3
proof
...and 21 more

Q-Learning for Continuous State and Action MDPs under Average Cost Criteria

TL;DR

Abstract

Q-Learning for Continuous State and Action MDPs under Average Cost Criteria

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (31)