Table of Contents
Fetching ...

On the Convergence of Modified Policy Iteration in Risk Sensitive Exponential Cost Markov Decision Processes

Yashaswini Murthy, Mehrdad Moharrami, R. Srikant

TL;DR

This work proves that Modified Policy Iteration converges for risk-sensitive exponential-cost MDPs with finite state and action spaces by leveraging a multiplicative Bellman structure and a carefully designed aperiodicity transformation. The authors introduce a theoretical convergence framework based on ratios of the Bellman update to the current iterate, monotonicity of a maximal ratio, and a contraction arising from products of ergodic transitions, complemented by a normalization to keep iterates bounded. They establish finite-time convergence guarantees and provide simulations showing MPI can outperform standard VI and PI in terms of computational efficiency across varying risk parameters and problem sizes. The results advance RL and DP for robust, risk-averse decision-making and lay groundwork for future extensions to learning in unknown or large-scale settings.

Abstract

Modified policy iteration (MPI) is a dynamic programming algorithm that combines elements of policy iteration and value iteration. The convergence of MPI has been well studied in the context of discounted and average-cost MDPs. In this work, we consider the exponential cost risk-sensitive MDP formulation, which is known to provide some robustness to model parameters. Although policy iteration and value iteration have been well studied in the context of risk sensitive MDPs, MPI is unexplored. We provide the first proof that MPI also converges for the risk-sensitive problem in the case of finite state and action spaces. Since the exponential cost formulation deals with the multiplicative Bellman equation, our main contribution is a convergence proof which is quite different than existing results for discounted and risk-neutral average-cost problems as well as risk sensitive value and policy iteration approaches. We conclude our analysis with simulation results, assessing MPI's performance relative to alternative dynamic programming methods like value iteration and policy iteration across diverse problem parameters. Our findings highlight risk-sensitive MPI's enhanced computational efficiency compared to both value and policy iteration techniques.

On the Convergence of Modified Policy Iteration in Risk Sensitive Exponential Cost Markov Decision Processes

TL;DR

This work proves that Modified Policy Iteration converges for risk-sensitive exponential-cost MDPs with finite state and action spaces by leveraging a multiplicative Bellman structure and a carefully designed aperiodicity transformation. The authors introduce a theoretical convergence framework based on ratios of the Bellman update to the current iterate, monotonicity of a maximal ratio, and a contraction arising from products of ergodic transitions, complemented by a normalization to keep iterates bounded. They establish finite-time convergence guarantees and provide simulations showing MPI can outperform standard VI and PI in terms of computational efficiency across varying risk parameters and problem sizes. The results advance RL and DP for robust, risk-averse decision-making and lay groundwork for future extensions to learning in unknown or large-scale settings.

Abstract

Modified policy iteration (MPI) is a dynamic programming algorithm that combines elements of policy iteration and value iteration. The convergence of MPI has been well studied in the context of discounted and average-cost MDPs. In this work, we consider the exponential cost risk-sensitive MDP formulation, which is known to provide some robustness to model parameters. Although policy iteration and value iteration have been well studied in the context of risk sensitive MDPs, MPI is unexplored. We provide the first proof that MPI also converges for the risk-sensitive problem in the case of finite state and action spaces. Since the exponential cost formulation deals with the multiplicative Bellman equation, our main contribution is a convergence proof which is quite different than existing results for discounted and risk-neutral average-cost problems as well as risk sensitive value and policy iteration approaches. We conclude our analysis with simulation results, assessing MPI's performance relative to alternative dynamic programming methods like value iteration and policy iteration across diverse problem parameters. Our findings highlight risk-sensitive MPI's enhanced computational efficiency compared to both value and policy iteration techniques.
Paper Structure (20 sections, 7 theorems, 57 equations, 3 figures, 1 algorithm)

This paper contains 20 sections, 7 theorems, 57 equations, 3 figures, 1 algorithm.

Key Result

Theorem 1

Given $\kappa\in(0,1)$, we have the followings:

Figures (3)

  • Figure 1: Convergence performance of value iteration and policy iteration in comparison with modified policy iteration across various risk sensitivity factors.
  • Figure 2: Comparison of time for convergence as a function of $m$
  • Figure 3: Convergence performance of value iteration and policy iteration in comparison with modified policy iteration across different state and action space cardinalities.

Theorems & Definitions (7)

  • Theorem 1
  • Theorem 2
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Theorem 2