Table of Contents
Fetching ...

Q-Learning for Continuous State and Action MDPs under Average Cost Criteria

Ali Devran Kara, Serdar Yuksel

TL;DR

This work tackles the challenge of solving infinite-horizon average-cost MDPs with continuous state and action spaces by developing discretization-based approximations and quantized Q-learning methods. It establishes near-optimality guarantees for finite, quantized models under weaker continuity assumptions (weak and Wasserstein) than previously required, and proves convergence of both synchronous and asynchronous Q-learning toward the optimal Q-values of these approximate models. By connecting discretized solutions to the original problem via contraction results and ergodicity conditions, the paper demonstrates that the learned policies are near-optimal for the underlying continuous-space MDP. The combination of finite-state/action approximations, rigorous error bounds, and convergent Q-learning offers a practical framework for planning and learning in average-cost settings with continuous dynamics.

Abstract

For infinite-horizon average-cost criterion problems, there exist relatively few rigorous approximation and reinforcement learning results. In this paper, for Markov Decision Processes (MDPs) with standard Borel spaces, (i) we first provide a discretization based approximation method for MDPs with continuous spaces under average cost criteria, and provide error bounds for approximations when the dynamics are only weakly continuous (for asymptotic convergence of errors as the grid sizes vanish) or Wasserstein continuous (with a rate in approximation as the grid sizes vanish) under certain ergodicity assumptions. In particular, we relax the total variation condition given in prior work to weak continuity or Wasserstein continuity. (ii) We provide synchronous and asynchronous (quantized) Q-learning algorithms for continuous spaces via quantization (where the quantized state is taken to be the actual state in corresponding Q-learning algorithms presented in the paper), and establish their convergence. (iii) We finally show that the convergence is to the optimal Q values of a finite approximate model constructed via quantization, which implies near optimality of the arrived solution. Our Q-learning convergence results and their convergence to near optimality are new for continuous spaces, and the proof method is new even for finite spaces, to our knowledge.

Q-Learning for Continuous State and Action MDPs under Average Cost Criteria

TL;DR

This work tackles the challenge of solving infinite-horizon average-cost MDPs with continuous state and action spaces by developing discretization-based approximations and quantized Q-learning methods. It establishes near-optimality guarantees for finite, quantized models under weaker continuity assumptions (weak and Wasserstein) than previously required, and proves convergence of both synchronous and asynchronous Q-learning toward the optimal Q-values of these approximate models. By connecting discretized solutions to the original problem via contraction results and ergodicity conditions, the paper demonstrates that the learned policies are near-optimal for the underlying continuous-space MDP. The combination of finite-state/action approximations, rigorous error bounds, and convergent Q-learning offers a practical framework for planning and learning in average-cost settings with continuous dynamics.

Abstract

For infinite-horizon average-cost criterion problems, there exist relatively few rigorous approximation and reinforcement learning results. In this paper, for Markov Decision Processes (MDPs) with standard Borel spaces, (i) we first provide a discretization based approximation method for MDPs with continuous spaces under average cost criteria, and provide error bounds for approximations when the dynamics are only weakly continuous (for asymptotic convergence of errors as the grid sizes vanish) or Wasserstein continuous (with a rate in approximation as the grid sizes vanish) under certain ergodicity assumptions. In particular, we relax the total variation condition given in prior work to weak continuity or Wasserstein continuity. (ii) We provide synchronous and asynchronous (quantized) Q-learning algorithms for continuous spaces via quantization (where the quantized state is taken to be the actual state in corresponding Q-learning algorithms presented in the paper), and establish their convergence. (iii) We finally show that the convergence is to the optimal Q values of a finite approximate model constructed via quantization, which implies near optimality of the arrived solution. Our Q-learning convergence results and their convergence to near optimality are new for continuous spaces, and the proof method is new even for finite spaces, to our knowledge.
Paper Structure (21 sections, 14 theorems, 143 equations, 3 figures, 2 algorithms)

This paper contains 21 sections, 14 theorems, 143 equations, 3 figures, 2 algorithms.

Key Result

Theorem 2.2

surveyHernandezLermaMCP \newlabelACOEEqn[Verification Theorem] \newlabelver_thm Let $j,h,f$ be a canonical triplet. a) If $j$ is a constant and for all $x$ and under every policy $\gamma$, then the stationary deterministic policy $\gamma^* = \{f,f,f,\cdots\}$ is optimal so that where b) If $j$, considered above, is not a constant and depends on $x$, then provided that (condConv01) holds. Furth

Figures (3)

  • Figure 6.1: Relative value function convergence for synchronous(left) and asynchronous(right) algorithms
  • Figure 6.2: Algorithm convergence under different initial conditions
  • Figure 6.3: Learned policy performance under different quantization rates

Theorems & Definitions (31)

  • Example 2.1
  • Definition 2.1
  • Theorem 2.2
  • Theorem 2.3
  • Remark 2.1
  • Theorem 2.4
  • Definition 3.1
  • Theorem 3.2
  • Theorem 3.3
  • proof
  • ...and 21 more