Table of Contents
Fetching ...

Robust Lagrangian and Adversarial Policy Gradient for Robust Constrained Markov Decision Processes

David M. Bossens

TL;DR

This work tackles robust constrained MDPs (RCMDPs) by addressing limitations of the baseline RCPG, notably its partial robustness and abrupt, non-incremental updates. It introduces two algorithms: RCPG with Robust Lagrangian, which optimises the Lagrangian under an L1 uncertainty set, and Adversarial RCPG, which incrementally trains an adversarial policy to minimize the Lagrangian within the same budget. The authors derive Lagrangian policy-gradient theorems for both approaches and demonstrate, across inventory management and safe navigation tasks with perturbed dynamics, that Adversarial RCPG achieves top performance and stable learning, while the robust Lagrangian variant also yields strong results. Together, these methods advance robust, safe RL by enabling gradual, theory-backed adaptation to uncertain dynamics and constraint satisfaction in RCMDPs, with practical implications for safety-critical control and planning under model misspecification.

Abstract

The robust constrained Markov decision process (RCMDP) is a recent task-modelling framework for reinforcement learning that incorporates behavioural constraints and that provides robustness to errors in the transition dynamics model through the use of an uncertainty set. Simulating RCMDPs requires computing the worst-case dynamics based on value estimates for each state, an approach which has previously been used in the Robust Constrained Policy Gradient (RCPG). Highlighting potential downsides of RCPG such as not robustifying the full constrained objective and the lack of incremental learning, this paper introduces two algorithms, called RCPG with Robust Lagrangian and Adversarial RCPG. RCPG with Robust Lagrangian modifies RCPG by taking the worst-case dynamics based on the Lagrangian rather than either the value or the constraint. Adversarial RCPG also formulates the worst-case dynamics based on the Lagrangian but learns this directly and incrementally as an adversarial policy through gradient descent rather than indirectly and abruptly through constrained optimisation on a sorted value list. A theoretical analysis first derives the Lagrangian policy gradient for the policy optimisation of both proposed algorithms and then the adversarial policy gradient to learn the adversary for Adversarial RCPG. Empirical experiments injecting perturbations in inventory management and safe navigation tasks demonstrate the competitive performance of both algorithms compared to traditional RCPG variants as well as non-robust and non-constrained ablations. In particular, Adversarial RCPG ranks among the top two performing algorithms on all tests.

Robust Lagrangian and Adversarial Policy Gradient for Robust Constrained Markov Decision Processes

TL;DR

This work tackles robust constrained MDPs (RCMDPs) by addressing limitations of the baseline RCPG, notably its partial robustness and abrupt, non-incremental updates. It introduces two algorithms: RCPG with Robust Lagrangian, which optimises the Lagrangian under an L1 uncertainty set, and Adversarial RCPG, which incrementally trains an adversarial policy to minimize the Lagrangian within the same budget. The authors derive Lagrangian policy-gradient theorems for both approaches and demonstrate, across inventory management and safe navigation tasks with perturbed dynamics, that Adversarial RCPG achieves top performance and stable learning, while the robust Lagrangian variant also yields strong results. Together, these methods advance robust, safe RL by enabling gradual, theory-backed adaptation to uncertain dynamics and constraint satisfaction in RCMDPs, with practical implications for safety-critical control and planning under model misspecification.

Abstract

The robust constrained Markov decision process (RCMDP) is a recent task-modelling framework for reinforcement learning that incorporates behavioural constraints and that provides robustness to errors in the transition dynamics model through the use of an uncertainty set. Simulating RCMDPs requires computing the worst-case dynamics based on value estimates for each state, an approach which has previously been used in the Robust Constrained Policy Gradient (RCPG). Highlighting potential downsides of RCPG such as not robustifying the full constrained objective and the lack of incremental learning, this paper introduces two algorithms, called RCPG with Robust Lagrangian and Adversarial RCPG. RCPG with Robust Lagrangian modifies RCPG by taking the worst-case dynamics based on the Lagrangian rather than either the value or the constraint. Adversarial RCPG also formulates the worst-case dynamics based on the Lagrangian but learns this directly and incrementally as an adversarial policy through gradient descent rather than indirectly and abruptly through constrained optimisation on a sorted value list. A theoretical analysis first derives the Lagrangian policy gradient for the policy optimisation of both proposed algorithms and then the adversarial policy gradient to learn the adversary for Adversarial RCPG. Empirical experiments injecting perturbations in inventory management and safe navigation tasks demonstrate the competitive performance of both algorithms compared to traditional RCPG variants as well as non-robust and non-constrained ablations. In particular, Adversarial RCPG ranks among the top two performing algorithms on all tests.
Paper Structure (22 sections, 2 theorems, 15 equations, 7 figures, 2 tables, 2 algorithms)

This paper contains 22 sections, 2 theorems, 15 equations, 7 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

Lagrangian policy gradient theorem. Let $\pi: \mathcal{S} \to \Delta^A$ be a stochastic policy, let $P$ be the transition dynamics, let $s_0$ be the starting state, and for any state-action pair $(s,a) \in \mathcal{S} \times \mathcal{A}$ define $\mathbf{Q}_{\pi}(s,a) = Q_{\pi}(s,a) - \lambda C_{\pi}

Figures (7)

  • Figure 1: Test performance metrics of the algorithms on the test set of perturbed transition dynamics in Inventory Management. For each of 20 training runs, each parameter setting is run 50 times and the plot displays the mean and standard error over runs. The parameter manipulated is the mean, $\mu$, and standard deviation, $\sigma$, of the demand distribution.
  • Figure 2: Illustration of the safe navigation tasks. In Safe Navigation 1, the constraint is to hit no more than 4 grey cells on average. In Safe Navigation 2, the constraint is to avoid the red cells and only a limited number of grey cells. Unconstrained solutions, constrained, and robust-constrained trajectories are demonstrated in red, orange, and green, respectively. The arrows represent the worst-case transitions for test Safe Navigation 2B.
  • Figure 3: Test performance metrics of the algorithms on the test set of perturbed transition dynamics in Safe Navigation 1. For each of 20 training runs, each parameter setting is run 50 times and the plot displays the mean and standard error over runs. Test A: The parameter manipulated is the move probability of the actions. Test B: The parameter manipulated is the number of perturbations, i.e. randomly selected state-action pairs that are perturbed with a random offset in $\mathcal{N}(s)$.
  • Figure 4: Test performance metrics of the algorithms on the test set of perturbed transition dynamics in Safe Navigation 2. For each of 20 training runs, each parameter setting is run 50 times and the plot displays the mean and standard error over runs. Test A: The parameter manipulated is the move probability of the actions. Test B: The parameter manipulated is the number of perturbations, i.e. randomly selected states that are perturbed with a worst-case transition according to the arrows in Fig. \ref{['fig: maze']}b.
  • Figure 5: Training performance metrics of the algorithms over 5,000 episodes on Inventory Management. Note that the training performance corresponds to the performance on the simulated transition dynamics, which is defined differently for the different algorithms.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2