Table of Contents
Fetching ...

FlowPG: Action-constrained Policy Gradient with Normalizing Flows

Janaka Chathuranga Brahmanage, Jiajing Ling, Akshat Kumar

TL;DR

FlowPG tackles action-constrained RL in continuous spaces by learning an invertible mapping from a latent uniform distribution to the feasible action space using normalizing flows. A conditional RealNVP maps latent samples to state-conditioned valid actions, trained via maximum likelihood on a dataset of feasible actions, while sampling from the feasible set is accomplished with HMC or PSDD. The learned flow enables end-to-end gradient propagation with DDPG, eliminating the need for projection layers and their associated zero-gradient issues, and it achieves substantially fewer constraint violations with competitive or superior performance and faster training across multiple continuous-control and resource-allocation tasks. The approach extends to other RL algorithms thanks to tractable action densities, offering a practical and scalable solution for safety-critical decision making in constrained environments.

Abstract

Action-constrained reinforcement learning (ACRL) is a popular approach for solving safety-critical and resource-allocation related decision making problems. A major challenge in ACRL is to ensure agent taking a valid action satisfying constraints in each RL step. Commonly used approach of using a projection layer on top of the policy network requires solving an optimization program which can result in longer training time, slow convergence, and zero gradient problem. To address this, first we use a normalizing flow model to learn an invertible, differentiable mapping between the feasible action space and the support of a simple distribution on a latent variable, such as Gaussian. Second, learning the flow model requires sampling from the feasible action space, which is also challenging. We develop multiple methods, based on Hamiltonian Monte-Carlo and probabilistic sentential decision diagrams for such action sampling for convex and non-convex constraints. Third, we integrate the learned normalizing flow with the DDPG algorithm. By design, a well-trained normalizing flow will transform policy output into a valid action without requiring an optimization solver. Empirically, our approach results in significantly fewer constraint violations (upto an order-of-magnitude for several instances) and is multiple times faster on a variety of continuous control tasks.

FlowPG: Action-constrained Policy Gradient with Normalizing Flows

TL;DR

FlowPG tackles action-constrained RL in continuous spaces by learning an invertible mapping from a latent uniform distribution to the feasible action space using normalizing flows. A conditional RealNVP maps latent samples to state-conditioned valid actions, trained via maximum likelihood on a dataset of feasible actions, while sampling from the feasible set is accomplished with HMC or PSDD. The learned flow enables end-to-end gradient propagation with DDPG, eliminating the need for projection layers and their associated zero-gradient issues, and it achieves substantially fewer constraint violations with competitive or superior performance and faster training across multiple continuous-control and resource-allocation tasks. The approach extends to other RL algorithms thanks to tractable action densities, offering a practical and scalable solution for safety-critical decision making in constrained environments.

Abstract

Action-constrained reinforcement learning (ACRL) is a popular approach for solving safety-critical and resource-allocation related decision making problems. A major challenge in ACRL is to ensure agent taking a valid action satisfying constraints in each RL step. Commonly used approach of using a projection layer on top of the policy network requires solving an optimization program which can result in longer training time, slow convergence, and zero gradient problem. To address this, first we use a normalizing flow model to learn an invertible, differentiable mapping between the feasible action space and the support of a simple distribution on a latent variable, such as Gaussian. Second, learning the flow model requires sampling from the feasible action space, which is also challenging. We develop multiple methods, based on Hamiltonian Monte-Carlo and probabilistic sentential decision diagrams for such action sampling for convex and non-convex constraints. Third, we integrate the learned normalizing flow with the DDPG algorithm. By design, a well-trained normalizing flow will transform policy output into a valid action without requiring an optimization solver. Empirically, our approach results in significantly fewer constraint violations (upto an order-of-magnitude for several instances) and is multiple times faster on a variety of continuous control tasks.
Paper Structure (20 sections, 13 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 20 sections, 13 equations, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: (a)An $\mathop{\mathrm{SDD}}\nolimits$ representing the PB constraint ${ A\cdot2^1+B\cdot2^0 + C\cdot2^1+D\cdot2^0 \leq 2}$; (b)A $\mathop{\mathrm{PSDD}}\nolimits$ ; (c) Samples generated with HMC
  • Figure 2: (a) Policy network; (b) The reversed gradient path of $\theta$ (in blue). Nodes denote variables and edges denote operations. Paths in black are detached for $\theta$. The green block is a negative loss (assuming a minimization task).
  • Figure 3: Mapping between a uniform distribution and action space of Reacher with constraint $a_1^2 + a_2^2 \leq 0.05$
  • Figure 4: Training curves for the Reacher, Half Cheetah, and BSS environments are displayed in columns from left to right, showcasing the Average Return($\uparrow$), Cumulative Constraint Violations($\downarrow$), Average Magnitude of Constraint Violations($\downarrow$), and Time Elapsed($\downarrow$).
  • Figure 5: Density map of generated valid actions using HMC and Rejection Sampling methods.
  • ...and 3 more figures