Solving Richly Constrained Reinforcement Learning through State Augmentation and Reward Penalties

Hao Jiang; Tien Mai; Pradeep Varakantham; Minh Huy Hoang

Solving Richly Constrained Reinforcement Learning through State Augmentation and Reward Penalties

Hao Jiang, Tien Mai, Pradeep Varakantham, Minh Huy Hoang

TL;DR

This work addresses constrained reinforcement learning by tying safety requirements to an augmented state that tracks accumulated cost and to reward penalties for constraint violations, turning constrained optimization into an unconstrained extended-MDP problem. By tuning the penalty parameter $\lambda$, the framework can represent risk-neutral, VaR, CVaR, and worst-case CMDPs, and it provides theoretical bounds to guarantee feasibility for these variants. The authors adapt DQN and SAC to this extended state, creating Safe DQN and Safe SAC that effectively credit-cost violations and enforce constraints while maintaining performance. Experimental results across RN-CMDP and CVaR-CMDP benchmarks, plus ablations and penalty-tuning studies, show that these methods often outperform leading constrained-RL approaches and match or exceed CVaR-focused baselines. The approach offers a unified, scalable paradigm for richly constrained RL, with practical impact in safety-critical domains, though it requires careful tuning of the reward penalty parameter per environment.

Abstract

Constrained Reinforcement Learning has been employed to enforce safety constraints on policy through the use of expected cost constraints. The key challenge is in handling expected cost accumulated using the policy and not just in a single step. Existing methods have developed innovative ways of converting this cost constraint over entire policy to constraints over local decisions (at each time step). While such approaches have provided good solutions with regards to objective, they can either be overly aggressive or conservative with respect to costs. This is owing to use of estimates for "future" or "backward" costs in local cost constraints. To that end, we provide an equivalent unconstrained formulation to constrained RL that has an augmented state space and reward penalties. This intuitive formulation is general and has interesting theoretical properties. More importantly, this provides a new paradigm for solving constrained RL problems effectively. As we show in our experimental results, we are able to outperform leading approaches on multiple benchmark problems from literature.

Solving Richly Constrained Reinforcement Learning through State Augmentation and Reward Penalties

TL;DR

, the framework can represent risk-neutral, VaR, CVaR, and worst-case CMDPs, and it provides theoretical bounds to guarantee feasibility for these variants. The authors adapt DQN and SAC to this extended state, creating Safe DQN and Safe SAC that effectively credit-cost violations and enforce constraints while maintaining performance. Experimental results across RN-CMDP and CVaR-CMDP benchmarks, plus ablations and penalty-tuning studies, show that these methods often outperform leading constrained-RL approaches and match or exceed CVaR-focused baselines. The approach offers a unified, scalable paradigm for richly constrained RL, with practical impact in safety-critical domains, though it requires careful tuning of the reward penalty parameter per environment.

Abstract

Paper Structure (33 sections, 13 theorems, 72 equations, 13 figures, 2 algorithms)

This paper contains 33 sections, 13 theorems, 72 equations, 13 figures, 2 algorithms.

Introduction
Constrained Markov Decision Process
Cost Augmented Formulation for Safe RL
Extended MDP Reformulation
Theoretical Properties
Safe RL Algorithms
Safe DQN
Safe SAC
Experimental Results
RN-CMDP
CVaR-CMDP
Ablation Analysis
Impact of Reward Penalty
Conclusion
Safe DQN Pseudocode
...and 18 more sections

Key Result

Proposition 3.1

If $\lambda = 0$, then equ:umdp is equivalent to the unconstrained MDP $\max_\pi \mathbb{E}\left[\sum_{t=0}^T {\gamma^t}r(s_t,a_t)|s_0,\pi\right]$.

Figures (13)

Figure 1: Gridworld environment and reward, cost comparison of different approaches.
Figure 2: Highway environment and reward, cost comparison of different approaches
Figure 3: Experiment with CVaR Constraint in Merge Environment
Figure 4: Ablation Analysis with GridWorld
Figure 5: Experiment in GridWorld with Different Reward Penalties
...and 8 more figures

Theorems & Definitions (25)

Proposition 3.1
Theorem 3.2: Connection to worst-case CMDP
Lemma 3.3
Lemma 3.4
Theorem 3.5: Connection to the risk-neural CMDP
Theorem 3.6: Connection to VaR CMDP
Theorem 3.7: VaR equivalence
Theorem 3.8: CVaR CMDP equivalence
proof
proof
...and 15 more

Solving Richly Constrained Reinforcement Learning through State Augmentation and Reward Penalties

TL;DR

Abstract

Solving Richly Constrained Reinforcement Learning through State Augmentation and Reward Penalties

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (25)