Table of Contents
Fetching ...

Model approximation in MDPs with unbounded per-step cost

Berk Bozkurt, Aditya Mahajan, Ashutosh Nayyar, Yi Ouyang

TL;DR

This work addresses the challenge of evaluating policies learned on an approximate MDP when the true MDP may incur unbounded per-step costs. It introduces a weighted-norm framework centered on Bellman mismatch functionals to bound the performance gap $\|V^{\hat{\pi}^\star}-V^\star\|_w$ and extends the theory via affine-cost transformations and integral probability metric (IPM) distances between models. The main contributions include explicit upper bounds (and their variants) that depend on mismatch between costs and transitions, conditions ensuring DP-solvability under weights, and practical instantiations through inventory management and LQR examples, showing tighter bounds than traditional sup-norm approaches. The results yield actionable guidance for designing approximate models and for assessing policy transfer when costs can be unbounded, with implications for RL and stochastic control under unbounded cost regimes.

Abstract

We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process $\mathcal{M}$ when we only have access to an approximate model $\hat{\mathcal{M}}$. How well does an optimal policy $\hatπ^{\star}$ of the approximate model perform when used in the original model $\mathcal{M}$? We answer this question by bounding a weighted norm of the difference between the value function of $\hatπ^\star $ when used in $\mathcal{M}$ and the optimal value function of $\mathcal{M}$. We then extend our results and obtain potentially tighter upper bounds by considering affine transformations of the per-step cost. We further provide upper bounds that explicitly depend on the weighted distance between cost functions and weighted distance between transition kernels of the original and approximate models. We present examples to illustrate our results.

Model approximation in MDPs with unbounded per-step cost

TL;DR

This work addresses the challenge of evaluating policies learned on an approximate MDP when the true MDP may incur unbounded per-step costs. It introduces a weighted-norm framework centered on Bellman mismatch functionals to bound the performance gap and extends the theory via affine-cost transformations and integral probability metric (IPM) distances between models. The main contributions include explicit upper bounds (and their variants) that depend on mismatch between costs and transitions, conditions ensuring DP-solvability under weights, and practical instantiations through inventory management and LQR examples, showing tighter bounds than traditional sup-norm approaches. The results yield actionable guidance for designing approximate models and for assessing policy transfer when costs can be unbounded, with implications for RL and stochastic control under unbounded cost regimes.

Abstract

We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process when we only have access to an approximate model . How well does an optimal policy of the approximate model perform when used in the original model ? We answer this question by bounding a weighted norm of the difference between the value function of when used in and the optimal value function of . We then extend our results and obtain potentially tighter upper bounds by considering affine transformations of the per-step cost. We further provide upper bounds that explicitly depend on the weighted distance between cost functions and weighted distance between transition kernels of the original and approximate models. We present examples to illustrate our results.
Paper Structure (44 sections, 15 theorems, 88 equations, 4 figures, 1 table)

This paper contains 44 sections, 15 theorems, 88 equations, 4 figures, 1 table.

Key Result

lemma 1

Given an MDP $\mathcal{M}$ and a tuple $(\kappa, w)$, for any policy $\pi \in \Pi_S(\kappa, w)$, we have the following:

Figures (4)

  • Figure 1: Comparison of the bounds on $V^\star(s)$ based on weighted-norm and sup-norm.
  • Figure 3: Lower bounds obtained by different weight functions. Note that the curve corresponding to $\ell = 0$ is not visible in the zoomed in plot (b).
  • Figure 4: Lower bounds obtained by different choices of $\boldsymbol{\alpha}$.
  • Figure 5: Lower bounds obtained using stability of the model. Note that the curves corresponding to $\ell = 1.50 \cdot 10^{-4}$ and $\ell = 1.75\cdot 10^{-4}$ are not visible in both plots.

Theorems & Definitions (39)

  • definition 1: Weighted norm
  • remark 1
  • definition 2: Bellman operators
  • definition 3: One-step greedy policy
  • definition 4: Dynamic programming solvability
  • definition 5: $(\kappa, w)$ stability of a policy
  • remark 2
  • definition 6: $(\bar{\kappa}, \bar{w})$ stability of the model
  • remark 3
  • lemma 1
  • ...and 29 more