Table of Contents
Fetching ...

Constraint-Aware Reinforcement Learning via Adaptive Action Scaling

Murad Dawood, Usama Ahmed Siddiquie, Shahram Khorshidi, Maren Bennewitz

TL;DR

This work tackles safe RL under a hard-safety regime by decoupling reward optimization from safety enforcement via a cost-aware regulator that scales actions rather than overriding them. The regulator uses online cost estimates from twin critics to produce a per-dimension scaling $\tilde{a}_t = \rho_\theta(s_t,a_t,\hat{c}_t) \odot a_t$, preserving exploration while reducing violations. It integrates with off-policy actors like SAC and TD3 and introduces a regulator loss $\mathcal{L}_{reg}$ to balance cost reduction with action retention, ensuring stable learning. Empirically, the method achieves state-of-the-art return-to-cost ratios on Safety Gym locomotion tasks and safety-critical systems, reducing constraint violations by up to $\sim126\times$ and demonstrating robustness to noise with promising sim-to-real transfer potential.

Abstract

Safe reinforcement learning (RL) seeks to mitigate unsafe behaviors that arise from exploration during training by reducing constraint violations while maintaining task performance. Existing approaches typically rely on a single policy to jointly optimize reward and safety, which can cause instability due to conflicting objectives, or they use external safety filters that override actions and require prior system knowledge. In this paper, we propose a modular cost-aware regulator that scales the agent's actions based on predicted constraint violations, preserving exploration through smooth action modulation rather than overriding the policy. The regulator is trained to minimize constraint violations while avoiding degenerate suppression of actions. Our approach integrates seamlessly with off-policy RL methods such as SAC and TD3, and achieves state-of-the-art return-to-cost ratios on Safety Gym locomotion tasks with sparse costs, reducing constraint violations by up to 126 times while increasing returns by over an order of magnitude compared to prior methods.

Constraint-Aware Reinforcement Learning via Adaptive Action Scaling

TL;DR

This work tackles safe RL under a hard-safety regime by decoupling reward optimization from safety enforcement via a cost-aware regulator that scales actions rather than overriding them. The regulator uses online cost estimates from twin critics to produce a per-dimension scaling , preserving exploration while reducing violations. It integrates with off-policy actors like SAC and TD3 and introduces a regulator loss to balance cost reduction with action retention, ensuring stable learning. Empirically, the method achieves state-of-the-art return-to-cost ratios on Safety Gym locomotion tasks and safety-critical systems, reducing constraint violations by up to and demonstrating robustness to noise with promising sim-to-real transfer potential.

Abstract

Safe reinforcement learning (RL) seeks to mitigate unsafe behaviors that arise from exploration during training by reducing constraint violations while maintaining task performance. Existing approaches typically rely on a single policy to jointly optimize reward and safety, which can cause instability due to conflicting objectives, or they use external safety filters that override actions and require prior system knowledge. In this paper, we propose a modular cost-aware regulator that scales the agent's actions based on predicted constraint violations, preserving exploration through smooth action modulation rather than overriding the policy. The regulator is trained to minimize constraint violations while avoiding degenerate suppression of actions. Our approach integrates seamlessly with off-policy RL methods such as SAC and TD3, and achieves state-of-the-art return-to-cost ratios on Safety Gym locomotion tasks with sparse costs, reducing constraint violations by up to 126 times while increasing returns by over an order of magnitude compared to prior methods.

Paper Structure

This paper contains 15 sections, 10 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of cost-aware action scaling. The RL agent proposes an action that would result in the center of mass (COM) exceeding the velocity threshold (left). The regulator (blue) intervenes by scaling the action, keeping the velocity of the COM within the safe zone while allowing progress on the task. The yellow circles highlight the velocity threshold for the COM, illustrating how the regulator enforces a safety constraint while preserving task performance.
  • Figure 2: Overview of our modular safe RL architecture. The regulator (blue) scales actions produced by the unconstrained RL agent (yellow) based on predicted cost (purple), producing safety-aware actions (green) that are executed in the environment.
  • Figure 3: Performance comparison on the Safety Gymnasium locomotion environments. Each method is averaged over three independent runs; bold lines indicate the mean, and shaded areas show the standard deviation. Our methods (SAC-REG and TD3-REG) consistently achieve the best trade-off between return and cumulative constraint cost across all environments. Top: Episode return. Middle: Cumulative safety cost. Bottom: Return-to-Log-Cost ratio. Our methods outperform strong baselines, including PPO-PID stooke2020responsive, SIMMER sootla2022enhancing, SRCPO kim2024spectral, RESPO ganai2023iterative, and SEDITOR yu2022towards. SIMMER is omitted from the Swimmer plot as it consistently yields negative returns, moving opposite to the target velocity.
  • Figure 4: Performance comparison on the BiGlucose and CSTR environments. Our method (SAC-Reg) achieves high return with low cost, yielding the best return-to-cost ratio throughout training.
  • Figure 5: Ablation Study for evaluating the impact of the regulator hyperparameters $\lambda$ and $\beta$ on Return, Cumulative Cost, and Return-to-Cost ratio. Each curve shows the mean across three runs, and shaded regions indicate standard deviation. The top row varies $\lambda$ with fixed $\beta = 10$; the bottom row varies $\beta$ with fixed $\lambda = 0.0015$. As seen, smaller $\lambda$ values reduce cumulative cost, with $\lambda = 0.0015$ giving the best balance between performance and safety, while $\beta = 10$ provides the most favorable trade-off overall.
  • ...and 1 more figures