Directional-Clamp PPO

Gilad Karpel; Ruida Zhou; Shoham Sabach; Mohammad Ghavamzadeh

Directional-Clamp PPO

Gilad Karpel, Ruida Zhou, Shoham Sabach, Mohammad Ghavamzadeh

TL;DR

This work identifies a failure mode in PPO where importance-sampling ratios drift into the wrong direction relative to the action advantage, undermining learning despite clipping. To counter this, it introduces Directional-Clamp PPO (DClamp-PPO), which adds a directional penalty active only in the strict wrong-direction region defined by $w_{\theta}(s,a)>1+\beta$ for negative advantages or $w_{\theta}(s,a)<1-\beta$ for positive advantages, governed by a slope parameter $\alpha>1$. The authors provide theoretical justification showing that, starting in the wrong-direction region, the updates push the ratio toward 1 more effectively than standard PPO, and they validate this with extensive MuJoCo experiments showing improved stability and performance over PPO and strong variants like Leaky-PPO and PPO-RB. Overall, the results demonstrate that constraining wrong-direction updates can enhance trust-region behavior and learning efficiency in continuous-control tasks, suggesting broader applicability of directional penalties in policy optimization.

Abstract

Proximal Policy Optimization (PPO) is widely regarded as one of the most successful deep reinforcement learning algorithms, known for its robustness and effectiveness across a range of problems. The PPO objective encourages the importance ratio between the current and behavior policies to move to the "right" direction -- starting from importance sampling ratios equal to 1, increasing the ratios for actions with positive advantages and decreasing those with negative advantages. A clipping function is introduced to prevent over-optimization when updating the importance ratio in these "right" direction regions. Many PPO variants have been proposed to extend its success, most of which modify the objective's behavior by altering the clipping in the "right" direction regions. However, due to randomness in the rollouts and stochasticity of the policy optimization, we observe that the ratios frequently move to the "wrong" direction during the PPO optimization. This is a key factor hindering the improvement of PPO, but it has been largely overlooked. To address this, we propose the Directional-Clamp PPO algorithm (DClamp-PPO), which further penalizes the actions going to the strict "wrong" direction regions, where the advantage is positive (negative) and importance ratio falls below (above) $1 - β$ ($1+β$), for a tunable parameter $β\in (0, 1)$. The penalty is by enforcing a steeper loss slope, i.e., a clamp, in those regions. We demonstrate that DClamp-PPO consistently outperforms PPO, as well as its variants, by focusing on modifying the objective's behavior in the "right" direction, across various MuJoCo environments, using different random seeds. The proposed method is shown, both theoretically and empirically, to better avoid "wrong" direction updates while keeping the importance ratio closer to 1.

Directional-Clamp PPO

TL;DR

Abstract

Directional-Clamp PPO

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (2)