Table of Contents
Fetching ...

Leveraging Constraint Violation Signals For Action-Constrained Reinforcement Learning

Janaka Chathuranga Brahmanage, Jiajing Ling, Akshat Kumar

TL;DR

The paper tackles safe action-constrained reinforcement learning (ACRL) by addressing the limitations of projection-based methods through Constraint Violation-Flows (CV-Flows), which learn a state-conditioned invertible mapping from latent actions to feasible actions using a constraint-violation-derived target density $p(a|s)\propto e^{-\\lambda\mathrm{CV}(a,s)}$. By training the flow with reverse KL against this target and integrating it with Soft Actor-Critic (SAC) on latent actions, the authors achieve accurate and differentiable mapping to the feasible action space while avoiding backpropagation through the flow. The approach extends to state-wise constraints by learning the CV signal from environment interactions, and empirically demonstrates significantly fewer constraint violations (often >10x) with competitive or superior returns across eight MuJoCo tasks and multiple state-wise benchmarks, with favorable runtimes in non-convex settings. These results indicate CV-Flows offer a scalable, safer alternative for ACRL in continuous control, with practical implications for robotics and resource-management domains where strict adherence to constraints is essential.

Abstract

In many RL applications, ensuring an agent's actions adhere to constraints is crucial for safety. Most previous methods in Action-Constrained Reinforcement Learning (ACRL) employ a projection layer after the policy network to correct the action. However projection-based methods suffer from issues like the zero gradient problem and higher runtime due to the usage of optimization solvers. Recently methods were proposed to train generative models to learn a differentiable mapping between latent variables and feasible actions to address this issue. However, generative models require training using samples from the constrained action space, which itself is challenging. To address such limitations, first, we define a target distribution for feasible actions based on constraint violation signals, and train normalizing flows by minimizing the KL divergence between an approximated distribution over feasible actions and the target. This eliminates the need to generate feasible action samples, greatly simplifying the flow model learning. Second, we integrate the learned flow model with existing deep RL methods, which restrict it to exploring only the feasible action space. Third, we extend our approach beyond ACRL to handle state-wise constraints by learning the constraint violation signal from the environment. Empirically, our approach has significantly fewer constraint violations while achieving similar or better quality in several control tasks than previous best methods.

Leveraging Constraint Violation Signals For Action-Constrained Reinforcement Learning

TL;DR

The paper tackles safe action-constrained reinforcement learning (ACRL) by addressing the limitations of projection-based methods through Constraint Violation-Flows (CV-Flows), which learn a state-conditioned invertible mapping from latent actions to feasible actions using a constraint-violation-derived target density . By training the flow with reverse KL against this target and integrating it with Soft Actor-Critic (SAC) on latent actions, the authors achieve accurate and differentiable mapping to the feasible action space while avoiding backpropagation through the flow. The approach extends to state-wise constraints by learning the CV signal from environment interactions, and empirically demonstrates significantly fewer constraint violations (often >10x) with competitive or superior returns across eight MuJoCo tasks and multiple state-wise benchmarks, with favorable runtimes in non-convex settings. These results indicate CV-Flows offer a scalable, safer alternative for ACRL in continuous control, with practical implications for robotics and resource-management domains where strict adherence to constraints is essential.

Abstract

In many RL applications, ensuring an agent's actions adhere to constraints is crucial for safety. Most previous methods in Action-Constrained Reinforcement Learning (ACRL) employ a projection layer after the policy network to correct the action. However projection-based methods suffer from issues like the zero gradient problem and higher runtime due to the usage of optimization solvers. Recently methods were proposed to train generative models to learn a differentiable mapping between latent variables and feasible actions to address this issue. However, generative models require training using samples from the constrained action space, which itself is challenging. To address such limitations, first, we define a target distribution for feasible actions based on constraint violation signals, and train normalizing flows by minimizing the KL divergence between an approximated distribution over feasible actions and the target. This eliminates the need to generate feasible action samples, greatly simplifying the flow model learning. Second, we integrate the learned flow model with existing deep RL methods, which restrict it to exploring only the feasible action space. Third, we extend our approach beyond ACRL to handle state-wise constraints by learning the constraint violation signal from the environment. Empirically, our approach has significantly fewer constraint violations while achieving similar or better quality in several control tasks than previous best methods.

Paper Structure

This paper contains 22 sections, 1 theorem, 22 equations, 9 figures, 6 tables, 3 algorithms.

Key Result

Proposition 1

The log-probability of the combined policy, $\log \pi(a|s)$, can be approximated using $\hat{a}$ as:

Figures (9)

  • Figure 1: Two approaches to integrate action constraints with RL: (a) Mapping-based approach and (b) Projection-based approach.
  • Figure 2: Flow Model integration with the SAC Policy: $\mu$ represents the original policy network, $\hat{a}$ is the latent action, and $f$ is the mapping function that maps the latent action into a feasible environment action $a$. $\pi$ represents the combined policy. The latent action $\hat{a}$ is stored in the replay buffer to train both the $\mu$ and critic networks.
  • Figure 3: Evaluation returns for eight MuJoCo continuous control tasks during training. A higher return is better.
  • Figure 4: Evaluation returns for four state-constrained tasks during training. A higher return ($\uparrow$) is better.
  • Figure 5: Average timesteps per second of the RL agent for non-convex constraints tasks (higher is better $\uparrow$), CVFlow based approach has a significantly higher frame rate.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof