Model approximation in MDPs with unbounded per-step cost

Berk Bozkurt; Aditya Mahajan; Ashutosh Nayyar; Yi Ouyang

Model approximation in MDPs with unbounded per-step cost

Berk Bozkurt, Aditya Mahajan, Ashutosh Nayyar, Yi Ouyang

TL;DR

This work addresses the challenge of evaluating policies learned on an approximate MDP when the true MDP may incur unbounded per-step costs. It introduces a weighted-norm framework centered on Bellman mismatch functionals to bound the performance gap $\|V^{\hat{\pi}^\star}-V^\star\|_w$ and extends the theory via affine-cost transformations and integral probability metric (IPM) distances between models. The main contributions include explicit upper bounds (and their variants) that depend on mismatch between costs and transitions, conditions ensuring DP-solvability under weights, and practical instantiations through inventory management and LQR examples, showing tighter bounds than traditional sup-norm approaches. The results yield actionable guidance for designing approximate models and for assessing policy transfer when costs can be unbounded, with implications for RL and stochastic control under unbounded cost regimes.

Abstract

We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process $\mathcal{M}$ when we only have access to an approximate model $\hat{\mathcal{M}}$. How well does an optimal policy $\hatπ^{\star}$ of the approximate model perform when used in the original model $\mathcal{M}$? We answer this question by bounding a weighted norm of the difference between the value function of $\hatπ^\star $ when used in $\mathcal{M}$ and the optimal value function of $\mathcal{M}$. We then extend our results and obtain potentially tighter upper bounds by considering affine transformations of the per-step cost. We further provide upper bounds that explicitly depend on the weighted distance between cost functions and weighted distance between transition kernels of the original and approximate models. We present examples to illustrate our results.

Model approximation in MDPs with unbounded per-step cost

TL;DR

and extends the theory via affine-cost transformations and integral probability metric (IPM) distances between models. The main contributions include explicit upper bounds (and their variants) that depend on mismatch between costs and transitions, conditions ensuring DP-solvability under weights, and practical instantiations through inventory management and LQR examples, showing tighter bounds than traditional sup-norm approaches. The results yield actionable guidance for designing approximate models and for assessing policy transfer when costs can be unbounded, with implications for RL and stochastic control under unbounded cost regimes.

Abstract

We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process

when we only have access to an approximate model

. How well does an optimal policy

of the approximate model perform when used in the original model

? We answer this question by bounding a weighted norm of the difference between the value function of

when used in

and the optimal value function of

. We then extend our results and obtain potentially tighter upper bounds by considering affine transformations of the per-step cost. We further provide upper bounds that explicitly depend on the weighted distance between cost functions and weighted distance between transition kernels of the original and approximate models. We present examples to illustrate our results.

Paper Structure (44 sections, 15 theorems, 88 equations, 4 figures, 1 table)

This paper contains 44 sections, 15 theorems, 88 equations, 4 figures, 1 table.

Introduction
Preliminaries
Markov decision processes
Dynamic programming solvability
Weighted-norm stability
Problem formulation and approximation bounds
Model approximation in MDPs
Approximation bounds
Discussion
Bounds under stability of deterministic open loop policies
Generalized bounds based on affine transformations of the cost
Some instances of the main results
Inventory management
Initial State dependent weight function
Generalized bounds based on cost transformation
...and 29 more sections

Key Result

lemma 1

Given an MDP $\mathcal{M}$ and a tuple $(\kappa, w)$, for any policy $\pi \in \Pi_S(\kappa, w)$, we have the following:

Figures (4)

Figure 1: Comparison of the bounds on $V^\star(s)$ based on weighted-norm and sup-norm.
Figure 3: Lower bounds obtained by different weight functions. Note that the curve corresponding to $\ell = 0$ is not visible in the zoomed in plot (b).
Figure 4: Lower bounds obtained by different choices of $\boldsymbol{\alpha}$.
Figure 5: Lower bounds obtained using stability of the model. Note that the curves corresponding to $\ell = 1.50 \cdot 10^{-4}$ and $\ell = 1.75\cdot 10^{-4}$ are not visible in both plots.

Theorems & Definitions (39)

definition 1: Weighted norm
remark 1
definition 2: Bellman operators
definition 3: One-step greedy policy
definition 4: Dynamic programming solvability
definition 5: $(\kappa, w)$ stability of a policy
remark 2
definition 6: $(\bar{\kappa}, \bar{w})$ stability of the model
remark 3
lemma 1
...and 29 more

Model approximation in MDPs with unbounded per-step cost

TL;DR

Abstract

Model approximation in MDPs with unbounded per-step cost

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (39)