KFCPO: Kronecker-Factored Approximated Constrained Policy Optimization
Joonyoung Lim, Younghwan Yoo
TL;DR
KFCPO tackles safe reinforcement learning under cumulative cost constraints by marrying Kronecker-Factored Approximate Curvature with a margin-aware gradient mechanism and minibatch KL rollback. The method uses separate K-FAC based natural gradients for reward and cost, and dynamically balances them based on the agent’s proximity to the safety threshold, aided by a direction-aware projection to avoid gradient interference. A minibatch KL rollback further enforces trust-region like stability during on-policy updates. Empirical results on Safety Gymnasium via OmniSafe show that KFCPO achieves 10.3%–50.2% higher average returns than the best cost-respecting baselines while maintaining safer behavior across diverse tasks and even under high observation dimensionality. These findings indicate that combining scalable second-order optimization with adaptive safety-aware gradient blending yields robust, safe, and practically impactful performance improvements in CMDP settings.
Abstract
We propose KFCPO, a novel Safe Reinforcement Learning (Safe RL) algorithm that combines scalable Kronecker-Factored Approximate Curvature (K-FAC) based second-order policy optimization with safety-aware gradient manipulation. KFCPO leverages K-FAC to perform efficient and stable natural gradient updates by approximating the Fisher Information Matrix (FIM) in a layerwise, closed form manner, avoiding iterative approximation overheads. To address the tradeoff between reward maximization and constraint satisfaction, we introduce a margin aware gradient manipulation mechanism that adaptively adjusts the influence of reward and cost gradients based on the agent's proximity to safety boundaries. This method blends gradients using a direction sensitive projection, eliminating harmful interference and avoiding abrupt changes caused by fixed hard thresholds. Additionally, a minibatch level KL rollback strategy is adopted to ensure trust region compliance and to prevent destabilizing policy shifts. Experiments on Safety Gymnasium using OmniSafe show that KFCPO achieves 10.3% to 50.2% higher average return across environments compared to the best baseline that respected the safety constraint, demonstrating superior balance of safety and performance.
