Table of Contents
Fetching ...

ACPO: A Policy Optimization Algorithm for Average MDPs with Constraints

Akhil Agnihotri, Rahul Jain, Haipeng Luo

TL;DR

This work addresses policy optimization for average-constrained MDPs (ACMDPs), where long-run safety constraints are encoded by $J_{C_i}(\pi) \le l_i$ and the objective is the long-run average reward $J(\pi)$; standard discounted criteria can mislead constraint satisfaction. The authors derive new average-specific policy-improvement bounds and develop ACPO, a trust-region policy optimization algorithm that incorporates constraint satisfaction via a KL-divergence constraint and average-bias terms via $\widebar{V}^{\pi}$ and $\widebar{A}^{\pi}$. They implement a practical, sampling-based version using Lagrangian duality and a recovery mechanism, and demonstrate superior performance over state-of-the-art baselines on OpenAI Gym/Mujoco tasks. The approach provides a scalable, theory-grounded framework for safe, long-horizon RL applicable to robotics, RLHF for LLMs, and other safety-critical domains.

Abstract

Reinforcement Learning (RL) for constrained MDPs (CMDPs) is an increasingly important problem for various applications. Often, the average criterion is more suitable than the discounted criterion. Yet, RL for average-CMDPs (ACMDPs) remains a challenging problem. Algorithms designed for discounted constrained RL problems often do not perform well for the average CMDP setting. In this paper, we introduce a new policy optimization with function approximation algorithm for constrained MDPs with the average criterion. The Average-Constrained Policy Optimization (ACPO) algorithm is inspired by trust region-based policy optimization algorithms. We develop basic sensitivity theory for average CMDPs, and then use the corresponding bounds in the design of the algorithm. We provide theoretical guarantees on its performance, and through extensive experimental work in various challenging OpenAI Gym environments, show its superior empirical performance when compared to other state-of-the-art algorithms adapted for the ACMDPs.

ACPO: A Policy Optimization Algorithm for Average MDPs with Constraints

TL;DR

This work addresses policy optimization for average-constrained MDPs (ACMDPs), where long-run safety constraints are encoded by and the objective is the long-run average reward ; standard discounted criteria can mislead constraint satisfaction. The authors derive new average-specific policy-improvement bounds and develop ACPO, a trust-region policy optimization algorithm that incorporates constraint satisfaction via a KL-divergence constraint and average-bias terms via and . They implement a practical, sampling-based version using Lagrangian duality and a recovery mechanism, and demonstrate superior performance over state-of-the-art baselines on OpenAI Gym/Mujoco tasks. The approach provides a scalable, theory-grounded framework for safe, long-horizon RL applicable to robotics, RLHF for LLMs, and other safety-critical domains.

Abstract

Reinforcement Learning (RL) for constrained MDPs (CMDPs) is an increasingly important problem for various applications. Often, the average criterion is more suitable than the discounted criterion. Yet, RL for average-CMDPs (ACMDPs) remains a challenging problem. Algorithms designed for discounted constrained RL problems often do not perform well for the average CMDP setting. In this paper, we introduce a new policy optimization with function approximation algorithm for constrained MDPs with the average criterion. The Average-Constrained Policy Optimization (ACPO) algorithm is inspired by trust region-based policy optimization algorithms. We develop basic sensitivity theory for average CMDPs, and then use the corresponding bounds in the design of the algorithm. We provide theoretical guarantees on its performance, and through extensive experimental work in various challenging OpenAI Gym environments, show its superior empirical performance when compared to other state-of-the-art algorithms adapted for the ACMDPs.
Paper Structure (30 sections, 13 theorems, 52 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 30 sections, 13 theorems, 52 equations, 5 figures, 2 tables, 1 algorithm.

Key Result

Lemma 3.0

zhang2020average Under the unichain assumption of the underlying Markov chain, for any stochastic policies $\pi$ and $\pi'$:

Figures (5)

  • Figure 1: The average reward and constraint cost function values vs iterations (in $10^{4}$) learning curves for some algorithm-task pairs. Solid lines in each figure are the empirical means, while the shaded area represents 1 standard deviation, all over 5 runs. The dashed line in constraint plots is the constraint threshold $l$. ATRPO and PPO are tested with constraints, which are included in their Lagrangian formulation. Additional results are available in Appendix \ref{['appendix:additional_results']}.
  • Figure 2: Comparison of performance of ACPO with different values of the hyperparameter $t$ in the Point-Circle environment. X-axis is iterations in $10^{4}$. See Appendix \ref{['appendix:additional_results']} for more details.
  • Figure 3: The Circle, Gather, Grid, and Bottleneck tasks. (a) Circle: The agent is rewarded for moving in a specified circle but is penalized if the diameter of the circle is larger than some value as in achiam2017constrained. (b) Gather: The agent is rewarded for collecting the green balls while penalized to gather red balls as in achiam2017constrained. (c) Grid: The agent controls traffic lights in a 3x3 road network and is rewarded for high traffic throughput but is constrained to let lights be red for at most 5 consecutive seconds as in vinitsky2018benchmarks. (d) Botteneck: The agent controls vehicles (red) in a merging traffic situation and is rewarded for maximizing the number of vehicles that pass through but is constrained to ensure that white vehicles (not controlled by agent) have "low" speed for no more than 10 seconds as in vinitsky2018benchmarks.
  • Figure 4: The average reward and constraint cost function values vs iterations (in $10^{4}$) learning curves for some algorithm-task pairs. Solid lines in each figure are the empirical means, while the shaded area represents 1 standard deviation, all over 5 runs. The dashed line in constraint plots is the constraint threshold $l$. ATRPO and PPO are tested with constraints, which are included in their Lagrangian formulation.
  • Figure 5: Comparison of performance of ACPO with different values of the hyperparameter $t$ in various environment. X-axis is iterations in $10^{4}$.

Theorems & Definitions (21)

  • Lemma 3.0
  • Lemma 3.0
  • Lemma 3.0
  • Proposition 3.1
  • Corollary 3.2
  • Theorem 3.3
  • Lemma 1.1: Trivialization of Discounted Criterion Bounds
  • proof
  • Lemma 1.1
  • proof
  • ...and 11 more