Table of Contents
Fetching ...

KFCPO: Kronecker-Factored Approximated Constrained Policy Optimization

Joonyoung Lim, Younghwan Yoo

TL;DR

KFCPO tackles safe reinforcement learning under cumulative cost constraints by marrying Kronecker-Factored Approximate Curvature with a margin-aware gradient mechanism and minibatch KL rollback. The method uses separate K-FAC based natural gradients for reward and cost, and dynamically balances them based on the agent’s proximity to the safety threshold, aided by a direction-aware projection to avoid gradient interference. A minibatch KL rollback further enforces trust-region like stability during on-policy updates. Empirical results on Safety Gymnasium via OmniSafe show that KFCPO achieves 10.3%–50.2% higher average returns than the best cost-respecting baselines while maintaining safer behavior across diverse tasks and even under high observation dimensionality. These findings indicate that combining scalable second-order optimization with adaptive safety-aware gradient blending yields robust, safe, and practically impactful performance improvements in CMDP settings.

Abstract

We propose KFCPO, a novel Safe Reinforcement Learning (Safe RL) algorithm that combines scalable Kronecker-Factored Approximate Curvature (K-FAC) based second-order policy optimization with safety-aware gradient manipulation. KFCPO leverages K-FAC to perform efficient and stable natural gradient updates by approximating the Fisher Information Matrix (FIM) in a layerwise, closed form manner, avoiding iterative approximation overheads. To address the tradeoff between reward maximization and constraint satisfaction, we introduce a margin aware gradient manipulation mechanism that adaptively adjusts the influence of reward and cost gradients based on the agent's proximity to safety boundaries. This method blends gradients using a direction sensitive projection, eliminating harmful interference and avoiding abrupt changes caused by fixed hard thresholds. Additionally, a minibatch level KL rollback strategy is adopted to ensure trust region compliance and to prevent destabilizing policy shifts. Experiments on Safety Gymnasium using OmniSafe show that KFCPO achieves 10.3% to 50.2% higher average return across environments compared to the best baseline that respected the safety constraint, demonstrating superior balance of safety and performance.

KFCPO: Kronecker-Factored Approximated Constrained Policy Optimization

TL;DR

KFCPO tackles safe reinforcement learning under cumulative cost constraints by marrying Kronecker-Factored Approximate Curvature with a margin-aware gradient mechanism and minibatch KL rollback. The method uses separate K-FAC based natural gradients for reward and cost, and dynamically balances them based on the agent’s proximity to the safety threshold, aided by a direction-aware projection to avoid gradient interference. A minibatch KL rollback further enforces trust-region like stability during on-policy updates. Empirical results on Safety Gymnasium via OmniSafe show that KFCPO achieves 10.3%–50.2% higher average returns than the best cost-respecting baselines while maintaining safer behavior across diverse tasks and even under high observation dimensionality. These findings indicate that combining scalable second-order optimization with adaptive safety-aware gradient blending yields robust, safe, and practically impactful performance improvements in CMDP settings.

Abstract

We propose KFCPO, a novel Safe Reinforcement Learning (Safe RL) algorithm that combines scalable Kronecker-Factored Approximate Curvature (K-FAC) based second-order policy optimization with safety-aware gradient manipulation. KFCPO leverages K-FAC to perform efficient and stable natural gradient updates by approximating the Fisher Information Matrix (FIM) in a layerwise, closed form manner, avoiding iterative approximation overheads. To address the tradeoff between reward maximization and constraint satisfaction, we introduce a margin aware gradient manipulation mechanism that adaptively adjusts the influence of reward and cost gradients based on the agent's proximity to safety boundaries. This method blends gradients using a direction sensitive projection, eliminating harmful interference and avoiding abrupt changes caused by fixed hard thresholds. Additionally, a minibatch level KL rollback strategy is adopted to ensure trust region compliance and to prevent destabilizing policy shifts. Experiments on Safety Gymnasium using OmniSafe show that KFCPO achieves 10.3% to 50.2% higher average return across environments compared to the best baseline that respected the safety constraint, demonstrating superior balance of safety and performance.

Paper Structure

This paper contains 15 sections, 15 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Adaptive blending weights based on $c_{\text{ep}}$.
  • Figure 2: Blending behavior: (left) with projection, (right) direct combination.
  • Figure 3: Final updates under conflict show varying degrees of misalignment with $\tilde{g}_c$.
  • Figure 4: Visual appearances of agents: (left) Point, (right) Car agent.
  • Figure 5: Examples of task environments: (left) Goal, (right) Button task.
  • ...and 2 more figures