Table of Contents
Fetching ...

Optimal Control-Based Baseline for Guided Exploration in Policy Gradient Methods

Xubo Lyu, Site Li, Seth Siriya, Ye Pu, Mo Chen

TL;DR

A novel optimal control-based baseline function is presented for the policy gradient method in deep reinforcement learning (RL) by computing the value function of an optimal control problem, which is formed to be closely associated with the RL task.

Abstract

In this paper, a novel optimal control-based baseline function is presented for the policy gradient method in deep reinforcement learning (RL). The baseline is obtained by computing the value function of an optimal control problem, which is formed to be closely associated with the RL task. In contrast to the traditional baseline aimed at variance reduction of policy gradient estimates, our work utilizes the optimal control value function to introduce a novel aspect to the role of baseline -- providing guided exploration during policy learning. This aspect is less discussed in prior works. We validate our baseline on robot learning tasks, showing its effectiveness in guided exploration, particularly in sparse reward environments.

Optimal Control-Based Baseline for Guided Exploration in Policy Gradient Methods

TL;DR

A novel optimal control-based baseline function is presented for the policy gradient method in deep reinforcement learning (RL) by computing the value function of an optimal control problem, which is formed to be closely associated with the RL task.

Abstract

In this paper, a novel optimal control-based baseline function is presented for the policy gradient method in deep reinforcement learning (RL). The baseline is obtained by computing the value function of an optimal control problem, which is formed to be closely associated with the RL task. In contrast to the traditional baseline aimed at variance reduction of policy gradient estimates, our work utilizes the optimal control value function to introduce a novel aspect to the role of baseline -- providing guided exploration during policy learning. This aspect is less discussed in prior works. We validate our baseline on robot learning tasks, showing its effectiveness in guided exploration, particularly in sparse reward environments.

Paper Structure

This paper contains 17 sections, 9 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of our method. Left: We propose a novel baseline function for policy gradient RL. Our method involves extracting RL info from the robot and environment, which is used to formulate an associated optimal control problem. Subsequently, we compute the optimal control value function and used it as a baseline for the policy gradient RL. Right: The RL info encompasses crucial aspects of the RL problem, including robot types, RL full state, task reward, and more. This RL info serves to form the key components of an optimal control problem, which include the robot system model, objectives, and constraints. Techniques like Model Predictive Control (MPC) can then be employed to compute the value function, which can be utilized as an RL baseline.
  • Figure 2: Car navigation environment
  • Figure 3: The car navigation reward performance
  • Figure 4: The optimal control value heatmap for car navigation
  • Figure 5: Policy advantage estimation w/ and w/o OC baseline.
  • ...and 3 more figures