A dynamical clipping approach with task feedback for Proximal Policy Optimization

Ziqi Zhang; Jingzehua Xu; Zifeng Zhuang; Hongyin Zhang; Jinxin Liu; Donglin wang; Shuai Zhang

A dynamical clipping approach with task feedback for Proximal Policy Optimization

Ziqi Zhang, Jingzehua Xu, Zifeng Zhuang, Hongyin Zhang, Jinxin Liu, Donglin wang, Shuai Zhang

TL;DR

Pb-PPO utilizes a multi-armed bandit approach to refelect RL preference, recommending the clipping bound for PPO that can maximizes the current Return, which results in greater stability and improved performance compared to PPO with a fixed clipping bound.

Abstract

Proximal Policy Optimization (PPO) has been broadly applied to robotics learning, showcasing stable training performance. However, the fixed clipping bound setting may limit the performance of PPO. Specifically, there is no theoretical proof that the optimal clipping bound remains consistent throughout the entire training process. Meanwhile, previous researches suggest that a fixed clipping bound restricts the policy's ability to explore. Therefore, many past studies have aimed to dynamically adjust the PPO clipping bound to enhance PPO's performance. However, the objective of these approaches are not directly aligned with the objective of reinforcement learning (RL) tasks, which is to maximize the cumulative Return. Unlike previous clipping approaches, we propose a bi-level proximal policy optimization objective that can dynamically adjust the clipping bound to better reflect the preference (maximizing Return) of these RL tasks. Based on this bi-level proximal policy optimization paradigm, we introduce a new algorithm named Preference based Proximal Policy Optimization (Pb-PPO). Pb-PPO utilizes a multi-armed bandit approach to refelect RL preference, recommending the clipping bound for PPO that can maximizes the current Return. Therefore, Pb-PPO results in greater stability and improved performance compared to PPO with a fixed clipping bound. We test Pb-PPO on locomotion benchmarks across multiple environments, including Gym-Mujoco and legged-gym. Additionally, we validate Pb-PPO on customized navigation tasks. Meanwhile, we conducted comparisons with PPO using various fixed clipping bounds and various of clipping approaches. The experimental results indicate that Pb-PPO demonstrates superior training performance compared to PPO and its variants. Our codebase has been released at : https://github.com/stevezhangzA/pb_ppo

A dynamical clipping approach with task feedback for Proximal Policy Optimization

TL;DR

Abstract

Paper Structure (36 sections, 17 equations, 6 figures, 2 tables)

This paper contains 36 sections, 17 equations, 6 figures, 2 tables.

Related Work
Proximal Policy Optimization (PPO).
Preference Based RL (PbRL).
Preliminary
Reinforcement Learning.
Proximal Policy Optimization (PPO).
Multi-armed bandit and Upper Confidence Bound (UCB).
Preference based Proximal Policy Optimization (Pb-PPO)
Bi-level Proximal Policy Optimization.
Preference based Proximal Policy Optimization (Pb-PPO).
Implementation of Objective 2)
Notations.
Sampling clipping bound with alternate uncertainty term.
Connection between Equation \ref{['ucb_computing']} and Equation \ref{['ucb']}.
Estimation of candidate clipping bounds' expected Return.
...and 21 more sections

Figures (6)

Figure 1: Pb-PPO (task feedback) on locomotion tasks. Each solid curve in these figures represent the average experimental results across multiple seeds, and the shadowed area corresponds to the fluctuation of Return curves.
Figure 2: Pb-PPO on AUV navigation tasks across different difficulty levels. (a) Average Return curve in the hard navigation task. (b) From left to right are trajectories of the AUV in the easy, medium, hard environment in turn, we introduce the environment setting in Appendix.
Figure 3: (a) Performance of Pb-PPO on flat terrain. The first image visualizes the training curve of Pb-PPO, showing the average returns in a parallel environment of multiple robots. The remaining images (depicting linear velocities in the x and y directions, and angular velocity) visualize the physical values and corresponding commands during the evaluation process using the pre-trained policy to initialize 50 robots. (b) Performance of Pb-PPO on complex terrain. We used the policy trained in a complex environment, testing them in the flat-discrete terrain environment shown in the first image. The remaining images are similar to the content shown in (a).
Figure 4: The walking states of quadruped robots (anymal-c) on complex terrains. The upper figure shows the quadruped robot trained by Pb-PPO, and the lower figure shows the quadruped robot trained by PPO. In general, Pb-PPO exhibits a more stable gait when climbing stairs.
Figure 5: (a) Return of Pb-PPO across different ${\rm num}(\zeta)$. (b) The success rate of policy improvement.
...and 1 more figures

A dynamical clipping approach with task feedback for Proximal Policy Optimization

TL;DR

Abstract

A dynamical clipping approach with task feedback for Proximal Policy Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (6)