Table of Contents
Fetching ...

e-COP : Episodic Constrained Optimization of Policies

Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Sahil Singla

TL;DR

The paper advances constrained RL in episodic, finite-horizon MDPs by introducing e-COP, a PPO-like policy optimization method anchored to a novel episodic policy-difference lemma. It replaces Hessian-based updates with a quadratic damping penalty and a ReLU-like constraint penalty, yielding a tractable, clipped surrogate that maintains equivalence to the CMDP optimum under appropriate scaling. The approach provides both theoretical guarantees and strong empirical performance on Safety Gym and MuJoCo benchmarks, outperforming or matching state-of-the-art baselines adapted to episodic settings. The proposed method is easy to integrate into existing RL pipelines and holds promise for safety-critical applications including RL from human feedback for large models.

Abstract

In this paper, we present the $\texttt{e-COP}$ algorithm, the first policy optimization algorithm for constrained Reinforcement Learning (RL) in episodic (finite horizon) settings. Such formulations are applicable when there are separate sets of optimization criteria and constraints on a system's behavior. We approach this problem by first establishing a policy difference lemma for the episodic setting, which provides the theoretical foundation for the algorithm. Then, we propose to combine a set of established and novel solution ideas to yield the $\texttt{e-COP}$ algorithm that is easy to implement and numerically stable, and provide a theoretical guarantee on optimality under certain scaling assumptions. Through extensive empirical analysis using benchmarks in the Safety Gym suite, we show that our algorithm has similar or better performance than SoTA (non-episodic) algorithms adapted for the episodic setting. The scalability of the algorithm opens the door to its application in safety-constrained Reinforcement Learning from Human Feedback for Large Language or Diffusion Models.

e-COP : Episodic Constrained Optimization of Policies

TL;DR

The paper advances constrained RL in episodic, finite-horizon MDPs by introducing e-COP, a PPO-like policy optimization method anchored to a novel episodic policy-difference lemma. It replaces Hessian-based updates with a quadratic damping penalty and a ReLU-like constraint penalty, yielding a tractable, clipped surrogate that maintains equivalence to the CMDP optimum under appropriate scaling. The approach provides both theoretical guarantees and strong empirical performance on Safety Gym and MuJoCo benchmarks, outperforming or matching state-of-the-art baselines adapted to episodic settings. The proposed method is easy to integrate into existing RL pipelines and holds promise for safety-critical applications including RL from human feedback for large models.

Abstract

In this paper, we present the algorithm, the first policy optimization algorithm for constrained Reinforcement Learning (RL) in episodic (finite horizon) settings. Such formulations are applicable when there are separate sets of optimization criteria and constraints on a system's behavior. We approach this problem by first establishing a policy difference lemma for the episodic setting, which provides the theoretical foundation for the algorithm. Then, we propose to combine a set of established and novel solution ideas to yield the algorithm that is easy to implement and numerically stable, and provide a theoretical guarantee on optimality under certain scaling assumptions. Through extensive empirical analysis using benchmarks in the Safety Gym suite, we show that our algorithm has similar or better performance than SoTA (non-episodic) algorithms adapted for the episodic setting. The scalability of the algorithm opens the door to its application in safety-constrained Reinforcement Learning from Human Feedback for Large Language or Diffusion Models.
Paper Structure (18 sections, 8 theorems, 41 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 18 sections, 8 theorems, 41 equations, 6 figures, 3 tables, 2 algorithms.

Key Result

Lemma 3.0

For an episode of length $H$ and two policies, $\bm{\pi}$ and $\bm{\pi}'$, the difference in their performance assuming identical initial state distribution $\mu$ (i.e., $s_{1} \sim \mu$) is given by

Figures (6)

  • Figure 1: The Humanoid, Circle, Reach, Grid, Bottleneck, and Navigation tasks. See Appendix \ref{['appendix:envs']} for details.
  • Figure 2: The cumulative episodic reward and constraint cost function values vs episode learning curves for some algorithm-task pairs. Solid lines in each figure are the means, while the shaded area represents 1 standard deviation, all over 5 runs. The dashed line in constraint plots is the constraint threshold.
  • Figure 3: Navigation environment with multiple constraints: Episodic Rewards (left), Cost1 (center, for hazards) and Cost2 (right, for pillars) of e-COP . The dashed line in the cost plots is the cost threshold (10 for Cost1 and 25 for Cost2). C1/C2 constrained means only taking Cost1/Cost2 into the e-COP loss function and ignoring the other one.
  • Figure 4: Cumulative episodic rewards and costs of baselines in two environments with two constraint thresholds.
  • Figure 5: The Humanoid, Circle, Reach, Grid, Bottleneck, and Navigation tasks. (a) Humanoid: The agent is to run as fast as possible on a flat surface, while not exceeding a specified speed limit i.e. the cost constraint. (b) Circle: The agent is rewarded for moving in a specified circle but is penalized if the diameter of the circle is larger than some value achiam2017constrained. (c) Reach: The agent is rewarded for reaching a goal while avoiding obstacles (cost constraints) that are placed to hinder the agent ray2019benchmarking. (d) Grid: The agent controls traffic lights in a 3x3 road network and is rewarded for high traffic throughput but is constrained to let lights be red for at most 5 consecutive seconds vinitsky2018benchmarks. (e) Bottleneck: The agent controls vehicles (red) in a merging traffic situation and is rewarded for maximizing the number of vehicles that pass through but is constrained to ensure that white vehicles (not controlled by agent) have "low" speed for no more than 10 seconds vinitsky2018benchmarks. (f) Navigation: The agent is rewarded for reaching the target area (green) but is constrained to avoid hazards (light purple) and impassible pillars (dark purple). The cost for hazards and pillars is different ray2019benchmarking.
  • ...and 1 more figures

Theorems & Definitions (13)

  • Lemma 3.0
  • Proposition 3.0
  • Theorem 3.1
  • Lemma A.0
  • proof
  • Proposition A.0
  • proof
  • Lemma A.0
  • proof
  • Lemma A.0
  • ...and 3 more