Table of Contents
Fetching ...

Quantizer Design for Finite Model Approximations, Model Learning, and Quantized Q-Learning for MDPs with Unbounded Spaces

Osman Bicer, Ali D. Kara, Serdar Yuksel

TL;DR

This work addresses finite-model approximations for Markov decision processes (MDPs) with unbounded state spaces by optimizing state quantizers. It derives refined upper bounds on the error between true optimal value functions and those of quantized models, using stationary policies and occupation/invariant measures to express the bounds and enable computable rates. A Foster–Lyapunov framework yields explicit decay rates of the quantization error as the number of bins $M$ grows, for both discounted and average-cost criteria, and these results extend to quantized Q-learning and empirical model learning, where the weighting measures are governed by invariant distributions under exploration. The contributions provide a unified, rate-based treatment of planning and learning in continuous-space MDPs, offering practical quantizer-design guidance that ensures near-optimal performance even in non-compact spaces, with explicit convergence guarantees.

Abstract

In this paper, for Markov decision processes (MDPs) with unbounded state spaces we present refined upper bounds presented in [Kara et. al. JMLR'23] on finite model approximation errors via optimizing the quantizers used for finite model approximations. We also consider implications on quantizer design for quantized Q-learning and empirical model learning, and the performance of policies obtained via Q-learning where the quantized state is treated as the state itself. We highlight the distinctions between planning, where approximating MDPs can be independently designed, and learning (either via Q-learning or empirical model learning), where approximating MDPs are restricted to be defined by invariant measures of Markov chains under exploration policies, leading to significant subtleties on quantizer design performance, even though asymptotic near optimality can be established under both setups. In particular, under Lyapunov growth conditions, we obtain explicit upper bounds which decay to zero as the number of bins approaches infinity

Quantizer Design for Finite Model Approximations, Model Learning, and Quantized Q-Learning for MDPs with Unbounded Spaces

TL;DR

This work addresses finite-model approximations for Markov decision processes (MDPs) with unbounded state spaces by optimizing state quantizers. It derives refined upper bounds on the error between true optimal value functions and those of quantized models, using stationary policies and occupation/invariant measures to express the bounds and enable computable rates. A Foster–Lyapunov framework yields explicit decay rates of the quantization error as the number of bins grows, for both discounted and average-cost criteria, and these results extend to quantized Q-learning and empirical model learning, where the weighting measures are governed by invariant distributions under exploration. The contributions provide a unified, rate-based treatment of planning and learning in continuous-space MDPs, offering practical quantizer-design guidance that ensures near-optimal performance even in non-compact spaces, with explicit convergence guarantees.

Abstract

In this paper, for Markov decision processes (MDPs) with unbounded state spaces we present refined upper bounds presented in [Kara et. al. JMLR'23] on finite model approximation errors via optimizing the quantizers used for finite model approximations. We also consider implications on quantizer design for quantized Q-learning and empirical model learning, and the performance of policies obtained via Q-learning where the quantized state is treated as the state itself. We highlight the distinctions between planning, where approximating MDPs can be independently designed, and learning (either via Q-learning or empirical model learning), where approximating MDPs are restricted to be defined by invariant measures of Markov chains under exploration policies, leading to significant subtleties on quantizer design performance, even though asymptotic near optimality can be established under both setups. In particular, under Lyapunov growth conditions, we obtain explicit upper bounds which decay to zero as the number of bins approaches infinity

Paper Structure

This paper contains 21 sections, 11 theorems, 122 equations, 1 algorithm.

Key Result

Theorem 2.1

\newlabeltheorem:error_bound_discounted Under Assumptions assumption:model and assumption:lipschitz_cost_and_transitions, for any initial state $x_0 \in \mathds{X}$, the error between the optimal value function $J^*_{\beta}(x_0)$ and the approximate value function $\hat{J}_{\beta}(x_0)$ satisfies: where $\Gamma$ is the set of admissible policies, and $L(X_t)$ is the loss function defined in eq:lo

Theorems & Definitions (26)

  • Remark 2.1
  • Theorem 2.1: Kara et al., 2023, Theorem 3 kara2023qlearning
  • Theorem 2.2
  • proof
  • Theorem 2.3
  • proof
  • Lemma 2.4
  • Corollary 2.5: to Theorem \ref{['theorem:optimized_error_bound']} and Theorem \ref{['theorem:higher_dimensions']}
  • Theorem 2.6
  • proof
  • ...and 16 more