Table of Contents
Fetching ...

Directional-Clamp PPO

Gilad Karpel, Ruida Zhou, Shoham Sabach, Mohammad Ghavamzadeh

TL;DR

This work identifies a failure mode in PPO where importance-sampling ratios drift into the wrong direction relative to the action advantage, undermining learning despite clipping. To counter this, it introduces Directional-Clamp PPO (DClamp-PPO), which adds a directional penalty active only in the strict wrong-direction region defined by $w_{\theta}(s,a)>1+\beta$ for negative advantages or $w_{\theta}(s,a)<1-\beta$ for positive advantages, governed by a slope parameter $\alpha>1$. The authors provide theoretical justification showing that, starting in the wrong-direction region, the updates push the ratio toward 1 more effectively than standard PPO, and they validate this with extensive MuJoCo experiments showing improved stability and performance over PPO and strong variants like Leaky-PPO and PPO-RB. Overall, the results demonstrate that constraining wrong-direction updates can enhance trust-region behavior and learning efficiency in continuous-control tasks, suggesting broader applicability of directional penalties in policy optimization.

Abstract

Proximal Policy Optimization (PPO) is widely regarded as one of the most successful deep reinforcement learning algorithms, known for its robustness and effectiveness across a range of problems. The PPO objective encourages the importance ratio between the current and behavior policies to move to the "right" direction -- starting from importance sampling ratios equal to 1, increasing the ratios for actions with positive advantages and decreasing those with negative advantages. A clipping function is introduced to prevent over-optimization when updating the importance ratio in these "right" direction regions. Many PPO variants have been proposed to extend its success, most of which modify the objective's behavior by altering the clipping in the "right" direction regions. However, due to randomness in the rollouts and stochasticity of the policy optimization, we observe that the ratios frequently move to the "wrong" direction during the PPO optimization. This is a key factor hindering the improvement of PPO, but it has been largely overlooked. To address this, we propose the Directional-Clamp PPO algorithm (DClamp-PPO), which further penalizes the actions going to the strict "wrong" direction regions, where the advantage is positive (negative) and importance ratio falls below (above) $1 - β$ ($1+β$), for a tunable parameter $β\in (0, 1)$. The penalty is by enforcing a steeper loss slope, i.e., a clamp, in those regions. We demonstrate that DClamp-PPO consistently outperforms PPO, as well as its variants, by focusing on modifying the objective's behavior in the "right" direction, across various MuJoCo environments, using different random seeds. The proposed method is shown, both theoretically and empirically, to better avoid "wrong" direction updates while keeping the importance ratio closer to 1.

Directional-Clamp PPO

TL;DR

This work identifies a failure mode in PPO where importance-sampling ratios drift into the wrong direction relative to the action advantage, undermining learning despite clipping. To counter this, it introduces Directional-Clamp PPO (DClamp-PPO), which adds a directional penalty active only in the strict wrong-direction region defined by for negative advantages or for positive advantages, governed by a slope parameter . The authors provide theoretical justification showing that, starting in the wrong-direction region, the updates push the ratio toward 1 more effectively than standard PPO, and they validate this with extensive MuJoCo experiments showing improved stability and performance over PPO and strong variants like Leaky-PPO and PPO-RB. Overall, the results demonstrate that constraining wrong-direction updates can enhance trust-region behavior and learning efficiency in continuous-control tasks, suggesting broader applicability of directional penalties in policy optimization.

Abstract

Proximal Policy Optimization (PPO) is widely regarded as one of the most successful deep reinforcement learning algorithms, known for its robustness and effectiveness across a range of problems. The PPO objective encourages the importance ratio between the current and behavior policies to move to the "right" direction -- starting from importance sampling ratios equal to 1, increasing the ratios for actions with positive advantages and decreasing those with negative advantages. A clipping function is introduced to prevent over-optimization when updating the importance ratio in these "right" direction regions. Many PPO variants have been proposed to extend its success, most of which modify the objective's behavior by altering the clipping in the "right" direction regions. However, due to randomness in the rollouts and stochasticity of the policy optimization, we observe that the ratios frequently move to the "wrong" direction during the PPO optimization. This is a key factor hindering the improvement of PPO, but it has been largely overlooked. To address this, we propose the Directional-Clamp PPO algorithm (DClamp-PPO), which further penalizes the actions going to the strict "wrong" direction regions, where the advantage is positive (negative) and importance ratio falls below (above) (), for a tunable parameter . The penalty is by enforcing a steeper loss slope, i.e., a clamp, in those regions. We demonstrate that DClamp-PPO consistently outperforms PPO, as well as its variants, by focusing on modifying the objective's behavior in the "right" direction, across various MuJoCo environments, using different random seeds. The proposed method is shown, both theoretically and empirically, to better avoid "wrong" direction updates while keeping the importance ratio closer to 1.

Paper Structure

This paper contains 23 sections, 1 theorem, 18 equations, 9 figures, 5 tables, 1 algorithm.

Key Result

Lemma 1

For any $t\in \Omega_{T}$, if $w_{\theta_0}(s_t,a_t)$ satisfies $\sum_{t'\in\Omega_T} \langle\nabla w_{\theta_0}(s_t,a_t),\nabla w_{\theta_0}(s_{t'},a_{t'})\rangle \hat{A}(s_t,a_t)\hat{A}(s_{t'},a_{t'})>0$, then there exists $\bar{\gamma}>0$, such that for any $\gamma \in (0,\bar{\gamma})$, we have

Figures (9)

  • Figure 1: The surrogate objective of PPO, Leaky PPO and PPO-RB as a function of the likelihood ratio $w_{\theta}(s,a)$ for positive advantage (left) and negative advantage (right). The black dot denotes $w_{\theta}(s,a) = 1$.
  • Figure 2: The right and wrong directions on the PPO surrogate function. The black dot at importance ratio 1 symbolizes the starting point at the optimization.
  • Figure 3: Histograms of importance ratios during PPO optimization for two MuJoCo environments.
  • Figure 4: The surrogate objectives of PPO and DClamp-PPO, where the red curve is $\mathcal{J}_{\text{DClamp-PPO}}$ and the black curve is $\mathcal{J}_{\text{PPO}}$.
  • Figure 5: Histograms of policy ratio values $r_{\theta}(s,a)$ measured during training for DClamp-PPO and PPO on representative MuJoCo tasks. Each histogram aggregates tens of millions of samples collected across optimization steps.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Lemma 1: DClamp-PPO update moves the ratio toward $1$ in strict wrong direction
  • proof : Proof of \ref{['lem:1']}