Table of Contents
Fetching ...

Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form

Toshinori Kitamura, Tadashi Kozuno, Wataru Kumagai, Kenta Hoshino, Yohei Hosoe, Kazumi Kasaura, Masashi Hamaya, Paavo Parmas, Yutaka Matsuo

TL;DR

This work tackles the problem of identifying near-optimal policies for robust constrained MDPs (RCMDPs) where a policy must minimize worst-case cost while respecting constraints under uncertainty. It introduces the epigraph form to decouple gradient signals and presents the Epigraph Robust Constrained Policy Gradient Search (EpiRC-PGS), a double-loop algorithm combining outer bisection on the objective threshold with an inner policy-gradient optimizer that uses a subgradient of the epigraph objective. The authors prove that the method achieves an \\varepsilon-optimal policy with a complexity of \\tilde{O}(\\varepsilon^{-4})$ robust policy evaluations and demonstrate superiority over Lagrangian-based baselines in synthetic RCMDP experiments. The approach broadens the practical reliability of RCMDP solutions and offers a theoretically grounded pathway toward safe RL under model uncertainty.

Abstract

Designing a safe policy for uncertain environments is crucial in real-world control systems. However, this challenge remains inadequately addressed within the Markov decision process (MDP) framework. This paper presents the first algorithm guaranteed to identify a near-optimal policy in a robust constrained MDP (RCMDP), where an optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments. We first prove that the conventional policy gradient approach to the Lagrangian max-min formulation can become trapped in suboptimal solutions. This occurs when its inner minimization encounters a sum of conflicting gradients from the objective and constraint functions. To address this, we leverage the epigraph form of the RCMDP problem, which resolves the conflict by selecting a single gradient from either the objective or the constraints. Building on the epigraph form, we propose a bisection search algorithm with a policy gradient subroutine and prove that it identifies an $\varepsilon$-optimal policy in an RCMDP with $\tilde{\mathcal{O}}(\varepsilon^{-4})$ robust policy evaluations.

Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form

TL;DR

This work tackles the problem of identifying near-optimal policies for robust constrained MDPs (RCMDPs) where a policy must minimize worst-case cost while respecting constraints under uncertainty. It introduces the epigraph form to decouple gradient signals and presents the Epigraph Robust Constrained Policy Gradient Search (EpiRC-PGS), a double-loop algorithm combining outer bisection on the objective threshold with an inner policy-gradient optimizer that uses a subgradient of the epigraph objective. The authors prove that the method achieves an \\varepsilon-optimal policy with a complexity of \\tilde{O}(\\varepsilon^{-4})$ robust policy evaluations and demonstrate superiority over Lagrangian-based baselines in synthetic RCMDP experiments. The approach broadens the practical reliability of RCMDP solutions and offers a theoretically grounded pathway toward safe RL under model uncertainty.

Abstract

Designing a safe policy for uncertain environments is crucial in real-world control systems. However, this challenge remains inadequately addressed within the Markov decision process (MDP) framework. This paper presents the first algorithm guaranteed to identify a near-optimal policy in a robust constrained MDP (RCMDP), where an optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments. We first prove that the conventional policy gradient approach to the Lagrangian max-min formulation can become trapped in suboptimal solutions. This occurs when its inner minimization encounters a sum of conflicting gradients from the objective and constraint functions. To address this, we leverage the epigraph form of the RCMDP problem, which resolves the conflict by selecting a single gradient from either the objective or the constraints. Building on the epigraph form, we propose a bisection search algorithm with a policy gradient subroutine and prove that it identifies an -optimal policy in an RCMDP with robust policy evaluations.
Paper Structure (59 sections, 29 theorems, 125 equations, 3 figures, 1 table, 5 algorithms)

This paper contains 59 sections, 29 theorems, 125 equations, 3 figures, 1 table, 5 algorithms.

Key Result

Lemma 1

For any $\pi \in \Pi$, transition kernel $P:\mathcal{S} \times \mathcal{A} \to \mathscr{P}(\mathcal{S})$, and cost $c \in \mathbb{R}^{SA}$, the gradient is given by $*{\nabla J_{c, P}(\pi)}(s, a) = Hd^{\pi}_{P}(s) Q^{\pi}_{c, P}(s, a) \quad\forall (s, a) \in \mathcal{S} \times \mathcal{A}\;.$

Figures (3)

  • Figure 1: (a): An RCMDP example illustrating the gradient conflict challenge. Action labels are omitted when transitions are action-independent. (b): Policy gradients in the example with $(\gamma, \delta, b_1)=(0.4, 0.09, 0)$. Arrows represent the gradient to decrease $L_{1}(\pi)$. ${\color{red}\pi_2}$ attracts policy gradients but is a local minimum since $L_{1}({\color{red} \pi_2}) > L_{1}({\color{blue} \pi_1})$, where ${\color{blue} \pi_1}(\cdot, {\color{blue} a_1})=1$ and ${\color{red} \pi_2}(\cdot, {\color{red} a_2})=1$.
  • Figure 2: Algorithmic idea to find $b_0 = J^\star$ in \ref{['example:RCMDP']} with $(\gamma, \delta, b_1) = (0.1, 0, 2/3)$.
  • Figure 3: Comparison of the algorithms in different settings (a), (b), and (c), defined in \ref{['sec:experiments']}. The feasible $\pi^{(k)}$ with the smallest return is marked; if none is feasible, the one with the smallest violation is marked. In all the settings, \ref{['algo:double-loop']} quickly identifies a feasible and low-return policy (). Top row: Constraint violation ($y$-axis: $J_{c_1, \mathcal{U}}(\pi^{(k)}) - b_1$). Policies in the $\text{blue area}$ satisfy the constraints. Middle row: Objective return relative to the uniform policy ($y$-axis: $J_{c_0, \mathcal{U}}(\pi^{(k)}) - J_{c_0, \mathcal{U}}(\pi_{\mathrm{unif}})$). Negative values indicate that the policies achieve non-trivial low cumulative objective cots. Bottom row: Constraint violation vs. relative objective return.

Theorems & Definitions (58)

  • Lemma 1: Policy gradient theorem
  • Definition 1
  • Theorem 1
  • Example 1
  • Theorem 2
  • Lemma 2
  • Theorem 3
  • Lemma 3
  • Remark 1: Comparison to Lagrange
  • Theorem 4: Optimality of stationary points
  • ...and 48 more