Boosting Reinforcement Learning Algorithms in Continuous Robotic Reaching Tasks using Adaptive Potential Functions

Yifei Chen; Lambert Schomaker; Francisco Cruz

Boosting Reinforcement Learning Algorithms in Continuous Robotic Reaching Tasks using Adaptive Potential Functions

Yifei Chen, Lambert Schomaker, Francisco Cruz

TL;DR

This work addresses accelerating reinforcement learning for continuous robotic reaching by integrating an adaptive potential function (APF) with the Deep Deterministic Policy Gradient (DDPG) to form APF-DDPG. APF is learned online in a discrete potential-state space, and its shaping reward F(s,s') = γ φ(s') − φ(s) is added to the environment reward, enabling faster and more robust learning in continuous control. The method is validated on Baxter reaching tasks in simulation and on a real Baxter robot, showing significantly higher performance and lower failure rates than vanilla DDPG. The results demonstrate the practical viability of online APF-based reward shaping for real-world robotic RL and point to future improvements in exploration to avoid local optima.

Abstract

In reinforcement learning, reward shaping is an efficient way to guide the learning process of an agent, as the reward can indicate the optimal policy of the task. The potential-based reward shaping framework was proposed to guarantee policy invariance after reward shaping, where a potential function is used to calculate the shaping reward. In former work, we proposed a novel adaptive potential function (APF) method to learn the potential function concurrently with training the agent based on information collected by the agent during the training process, and examined the APF method in discrete action space scenarios. This paper investigates the feasibility of using APF in solving continuous-reaching tasks in a real-world robotic scenario with continuous action space. We combine the Deep Deterministic Policy Gradient (DDPG) algorithm and our proposed method to form a new algorithm called APF-DDPG. To compare APF-DDPG with DDPG, we designed a task where the agent learns to control Baxter's right arm to reach a goal position. The experimental results show that the APF-DDPG algorithm outperforms the DDPG algorithm on both learning speed and robustness.

Boosting Reinforcement Learning Algorithms in Continuous Robotic Reaching Tasks using Adaptive Potential Functions

TL;DR

Abstract

Paper Structure (16 sections, 8 equations, 6 figures)

This paper contains 16 sections, 8 equations, 6 figures.

INTRODUCTION
Related Works
Background
Reinforcement Learning
Potential-based Reward Shaping
Deep Deterministic Policy Gradient
MDP in the robotic scenario
Methodology
Discrete Potential States
Adaptive Potential Function
APF-DDPG
Experiments
Experimental Parameters
Experimental Results
Experiments on A Real Baxter
...and 1 more sections

Figures (6)

Figure 1: A visualization of the experimental environments with a Baxter robot in CoppeliaSim (first row) and the real world (second row). The task is to control Baxter's right arm's joints so that its right tip can reach the goal area, shown as a gray cube in the upper left corner of the first-row figures. The left-column figures show the initial state of Baxter, and the right-column figures give an example of a successful state in CoppeliaSim and the real world.
Figure 2: An image of Baxter's right arm. Each joint is sequentially numbered from $1$ to $7$ from shoulder to wrist. The tip is at the end of the arm.
Figure 3: A schematic of APF-DDPG. The APF network is trained to output a potential value for each state. Then, the environmental reward is shaped based on the PBRS framework, and the shaped reward is collected to train the underlying DDPG networks. After training, the actor network can be used to control the Baxter to reach the goal in both the simulator and the real world.
Figure 4: Comparison of performances of the DDPG agent (the blue curve) and the APF-DDPG agent (the orange curve). Each curve is averaged over 20 experimental runs and each run is smoothed by an average window of 100 episodes. The shaded region represents the standard deviation range.
Figure 5: Comparison among the averaged cumulative reward over the last $100$ episodes in each of the 20 experiments for the DDPG agent and the APF-DDPG agent. Results are shown in ascending order for each agent. Blue dots represent DDPG, while orange dots represent APF-DDPG.
...and 1 more figures

Boosting Reinforcement Learning Algorithms in Continuous Robotic Reaching Tasks using Adaptive Potential Functions

TL;DR

Abstract

Boosting Reinforcement Learning Algorithms in Continuous Robotic Reaching Tasks using Adaptive Potential Functions

Authors

TL;DR

Abstract

Table of Contents

Figures (6)