Model approximation in MDPs with unbounded per-step cost
Berk Bozkurt, Aditya Mahajan, Ashutosh Nayyar, Yi Ouyang
TL;DR
This work addresses the challenge of evaluating policies learned on an approximate MDP when the true MDP may incur unbounded per-step costs. It introduces a weighted-norm framework centered on Bellman mismatch functionals to bound the performance gap $\|V^{\hat{\pi}^\star}-V^\star\|_w$ and extends the theory via affine-cost transformations and integral probability metric (IPM) distances between models. The main contributions include explicit upper bounds (and their variants) that depend on mismatch between costs and transitions, conditions ensuring DP-solvability under weights, and practical instantiations through inventory management and LQR examples, showing tighter bounds than traditional sup-norm approaches. The results yield actionable guidance for designing approximate models and for assessing policy transfer when costs can be unbounded, with implications for RL and stochastic control under unbounded cost regimes.
Abstract
We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov decision process $\mathcal{M}$ when we only have access to an approximate model $\hat{\mathcal{M}}$. How well does an optimal policy $\hatπ^{\star}$ of the approximate model perform when used in the original model $\mathcal{M}$? We answer this question by bounding a weighted norm of the difference between the value function of $\hatπ^\star $ when used in $\mathcal{M}$ and the optimal value function of $\mathcal{M}$. We then extend our results and obtain potentially tighter upper bounds by considering affine transformations of the per-step cost. We further provide upper bounds that explicitly depend on the weighted distance between cost functions and weighted distance between transition kernels of the original and approximate models. We present examples to illustrate our results.
