Table of Contents
Fetching ...

CIM-PPO:Proximal Policy Optimization with Liu-Correntropy Induced Metric

Yunxiao Guo, Han Long, Xiaojun Duan, Kaiyuan Feng, Maochu Li, Xiaying Ma

TL;DR

It is stated that the PPO-CIM algorithm has a lower computation cost in policy gradient and proves that PPO-CIM can guarantee the new policy is within the trust region while the kernel satisfies some conditions.

Abstract

As a popular Deep Reinforcement Learning (DRL) algorithm, Proximal Policy Optimization (PPO) has demonstrated remarkable efficacy in numerous complex tasks. According to the penalty mechanism in a surrogate, PPO can be classified into PPO with KL divergence (PPO-KL) and PPO with Clip (PPO-Clip). In this paper, we analyze the impact of asymmetry in KL divergence on PPO-KL and highlight that when this asymmetry is pronounced, it will misguide the improvement of the surrogate. To address this issue, we represent the PPO-KL in inner product form and demonstrate that the KL divergence is a Correntropy Induced Metric (CIM) in Euclidean space. Subsequently, we extend the PPO-KL to the Reproducing Kernel Hilbert Space (RKHS), redefine the inner products with RKHS, and propose the PPO-CIM algorithm. Moreover, this paper states that the PPO-CIM algorithm has a lower computation cost in policy gradient and proves that PPO-CIM can guarantee the new policy is within the trust region while the kernel satisfies some conditions. Finally, we design experiments based on six Mujoco continuous-action tasks to validate the proposed algorithm. The experimental results validate that the asymmetry of KL divergence can affect the policy improvement of PPO-KL and show that the PPO-CIM can perform better than both PPO-KL and PPO-Clip in most tasks.

CIM-PPO:Proximal Policy Optimization with Liu-Correntropy Induced Metric

TL;DR

It is stated that the PPO-CIM algorithm has a lower computation cost in policy gradient and proves that PPO-CIM can guarantee the new policy is within the trust region while the kernel satisfies some conditions.

Abstract

As a popular Deep Reinforcement Learning (DRL) algorithm, Proximal Policy Optimization (PPO) has demonstrated remarkable efficacy in numerous complex tasks. According to the penalty mechanism in a surrogate, PPO can be classified into PPO with KL divergence (PPO-KL) and PPO with Clip (PPO-Clip). In this paper, we analyze the impact of asymmetry in KL divergence on PPO-KL and highlight that when this asymmetry is pronounced, it will misguide the improvement of the surrogate. To address this issue, we represent the PPO-KL in inner product form and demonstrate that the KL divergence is a Correntropy Induced Metric (CIM) in Euclidean space. Subsequently, we extend the PPO-KL to the Reproducing Kernel Hilbert Space (RKHS), redefine the inner products with RKHS, and propose the PPO-CIM algorithm. Moreover, this paper states that the PPO-CIM algorithm has a lower computation cost in policy gradient and proves that PPO-CIM can guarantee the new policy is within the trust region while the kernel satisfies some conditions. Finally, we design experiments based on six Mujoco continuous-action tasks to validate the proposed algorithm. The experimental results validate that the asymmetry of KL divergence can affect the policy improvement of PPO-KL and show that the PPO-CIM can perform better than both PPO-KL and PPO-Clip in most tasks.

Paper Structure

This paper contains 15 sections, 4 theorems, 52 equations, 5 figures, 3 tables, 1 algorithm.

Key Result

Lemma 1

For any multi-dimensional normal distributions $\boldsymbol{\pi}_{\boldsymbol{\theta}_i}(\boldsymbol{a}|\boldsymbol{s})\sim\mathcal{N}(a|\mathbf{F}_{\boldsymbol{\theta}_i}^{\boldsymbol{\mu}},\mathbf{F}_{\theta_i}^{\boldsymbol{\Sigma}}),\boldsymbol{\pi}_{\boldsymbol{\theta}_j}(\boldsymbol{a}|\boldsym

Figures (5)

  • Figure 1: Screenshots of the environments for Mujoco continuous tasks: (a)Swimmer-v2. (b) Reacher-v2. (c) Hopper-v2. (d)Walker2d-v2. (e)Ant-v2. (f)Humanoid-v2
  • Figure 2: KL-divergence ${\rm D_{KL}}$ between two policies. (a): $\mu_1=\mu_2=1;\sigma\in[0.01,0.1]$; (b): $\mu_1=1,\mu_2=1.1, \sigma\in[0.01,0.1]$; (c): $\mu_1=1,\mu_2=1.1, \sigma\in[0.01,1]$;(d): $\mu_1=1,\mu_2=2, \sigma\in[0.01,1]$.
  • Figure 3: The asymmetry of the KL-divergence $\Delta {\rm D_{KL}}$ between two policies (a): $\mu_1=\mu_2=1, \sigma\in[0.01,0.1]$; (b): $\mu_1=1,\mu_2=1.1, \sigma\in[0.01,0.1]$; (c): $\mu_1=1,\mu_2=1.1, \sigma\in[0.01,1]$; (d): $\mu_1=1,\mu_2=2, \sigma\in[0.01,1]$
  • Figure 4: The scale compares between the asymmetry of KL-divergence and estimated advantage function: (a) Swimmer-v2; (b) Reacher-v2; (c) Hopper-v2; (d) Walker2d-v2; (e) Ant-v2; (f) Humanoid-v2
  • Figure 5: The training results in the Mujoco tasks: (a) Swimmer-v2; (b) Reacher-v2; (c) Hopper-v2; (d) Walker2d-v2; (e) Ant-v2; (e) Humanoid-v2

Theorems & Definitions (8)

  • Lemma 1
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • proof