Q-Learning for Continuous State and Action MDPs under Average Cost Criteria
Ali Devran Kara, Serdar Yuksel
TL;DR
This work tackles the challenge of solving infinite-horizon average-cost MDPs with continuous state and action spaces by developing discretization-based approximations and quantized Q-learning methods. It establishes near-optimality guarantees for finite, quantized models under weaker continuity assumptions (weak and Wasserstein) than previously required, and proves convergence of both synchronous and asynchronous Q-learning toward the optimal Q-values of these approximate models. By connecting discretized solutions to the original problem via contraction results and ergodicity conditions, the paper demonstrates that the learned policies are near-optimal for the underlying continuous-space MDP. The combination of finite-state/action approximations, rigorous error bounds, and convergent Q-learning offers a practical framework for planning and learning in average-cost settings with continuous dynamics.
Abstract
For infinite-horizon average-cost criterion problems, there exist relatively few rigorous approximation and reinforcement learning results. In this paper, for Markov Decision Processes (MDPs) with standard Borel spaces, (i) we first provide a discretization based approximation method for MDPs with continuous spaces under average cost criteria, and provide error bounds for approximations when the dynamics are only weakly continuous (for asymptotic convergence of errors as the grid sizes vanish) or Wasserstein continuous (with a rate in approximation as the grid sizes vanish) under certain ergodicity assumptions. In particular, we relax the total variation condition given in prior work to weak continuity or Wasserstein continuity. (ii) We provide synchronous and asynchronous (quantized) Q-learning algorithms for continuous spaces via quantization (where the quantized state is taken to be the actual state in corresponding Q-learning algorithms presented in the paper), and establish their convergence. (iii) We finally show that the convergence is to the optimal Q values of a finite approximate model constructed via quantization, which implies near optimality of the arrived solution. Our Q-learning convergence results and their convergence to near optimality are new for continuous spaces, and the proof method is new even for finite spaces, to our knowledge.
