Table of Contents
Fetching ...

Where-to-Learn: Analytical Policy Gradient Directed Exploration for On-Policy Robotic Reinforcement Learning

Leixin Chang, Xinchen Yao, Ben Liu, Liangjing Yang, Hua Chen

Abstract

On-policy reinforcement learning (RL) algorithms have demonstrated great potential in robotic control, where effective exploration is crucial for efficient and high-quality policy learning. However, how to encourage the agent to explore the better trajectories efficiently remains a challenge. Most existing methods incentivize exploration by maximizing the policy entropy or encouraging novel state visiting regardless of the potential state value. We propose a new form of directed exploration that uses analytical policy gradients from a differentiable dynamics model to inject task-aware, physics-guided guidance, thereby steering the agent towards high-reward regions for accelerated and more effective policy learning.

Where-to-Learn: Analytical Policy Gradient Directed Exploration for On-Policy Robotic Reinforcement Learning

Abstract

On-policy reinforcement learning (RL) algorithms have demonstrated great potential in robotic control, where effective exploration is crucial for efficient and high-quality policy learning. However, how to encourage the agent to explore the better trajectories efficiently remains a challenge. Most existing methods incentivize exploration by maximizing the policy entropy or encouraging novel state visiting regardless of the potential state value. We propose a new form of directed exploration that uses analytical policy gradients from a differentiable dynamics model to inject task-aware, physics-guided guidance, thereby steering the agent towards high-reward regions for accelerated and more effective policy learning.

Paper Structure

This paper contains 20 sections, 2 theorems, 16 equations, 10 figures, 3 tables, 1 algorithm.

Key Result

Lemma 4.1

Let the performance objective be $J(\boldsymbol{\theta})=\mathbb{E}_{\tau \sim \pi_{\boldsymbol{\theta}}}\left[ \sum_{t}\gamma^t r_t\right]$. Assume $J(\boldsymbol{\theta})$ is $L$-smooth, and Then $J(\boldsymbol{\theta}_k^{\text{explore}})\!>\! J(\boldsymbol{\theta}_k)$, whenever $\nabla_{\boldsymbol{\theta}}J(\boldsymbol{\theta}_k)\neq 0$.

Figures (10)

  • Figure 1: Illustration of the proposed directed exploration.
  • Figure 2: Method Overview.
  • Figure 3: Policy update iteration of the proposed method. In every iteration, the exploratory policy $\pi_k^{\text{explore}}$ is discarded after exploratory data collection.
  • Figure 4: Training curve of 8 benchmark tasks comparing the proposed method against the PPO baseline, the PPO with RND technique. The curves of SHAC serve as a reference. Solid lines and shaded regions depict the mean and standard deviation among the five trials, respectively. Across most benchmark tasks, our method achieves higher or matched asymptotic performance over the PPO baseline and PPO with RND technique, with improved sample efficiency and better training stability.
  • Figure 5: Advantage difference between the data collected by the exploratory policy and that of the primary policy in CartpoleBalance task. Data collected with 5 trials and smoothed with a factor of 0.7.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Lemma 4.1: Local Policy Improvement from an APG Update
  • proof
  • Theorem 4.2: Superiority of the Exploratory Policy in Expected Advantage
  • proof