Table of Contents
Fetching ...

Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations

Guojian Wang, Faguo Wu, Xiao Zhang, Tianyuan Chen

TL;DR

This work proposes a simple and efficient algorithm called Policy Optimization with Smooth Guidance (POSG), which leverages a small set of state-only demonstrations to indirectly make approximate and feasible long-term credit assignments and facilitate exploration.

Abstract

The sparsity of reward feedback remains a challenging problem in online deep reinforcement learning (DRL). Previous approaches have utilized offline demonstrations to achieve impressive results in multiple hard tasks. However, these approaches place high demands on demonstration quality, and obtaining expert-like actions is often costly and unrealistic. To tackle these problems, we propose a simple and efficient algorithm called Policy Optimization with Smooth Guidance (POSG), which leverages a small set of state-only demonstrations (where expert action information is not included in demonstrations) to indirectly make approximate and feasible long-term credit assignments and facilitate exploration. Specifically, we first design a trajectory-importance evaluation mechanism to determine the quality of the current trajectory against demonstrations. Then, we introduce a guidance reward computation technology based on trajectory importance to measure the impact of each state-action pair, fusing the demonstrator's state distribution with reward information into the guidance reward. We theoretically analyze the performance improvement caused by smooth guidance rewards and derive a new worst-case lower bound on the performance improvement. Extensive results demonstrate POSG's significant advantages in control performance and convergence speed in four sparse-reward environments, including the grid-world maze, Hopper-v4, HalfCheetah-v4, and Ant maze. Notably, the specific metrics and quantifiable results are investigated to demonstrate the superiority of POSG.

Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations

TL;DR

This work proposes a simple and efficient algorithm called Policy Optimization with Smooth Guidance (POSG), which leverages a small set of state-only demonstrations to indirectly make approximate and feasible long-term credit assignments and facilitate exploration.

Abstract

The sparsity of reward feedback remains a challenging problem in online deep reinforcement learning (DRL). Previous approaches have utilized offline demonstrations to achieve impressive results in multiple hard tasks. However, these approaches place high demands on demonstration quality, and obtaining expert-like actions is often costly and unrealistic. To tackle these problems, we propose a simple and efficient algorithm called Policy Optimization with Smooth Guidance (POSG), which leverages a small set of state-only demonstrations (where expert action information is not included in demonstrations) to indirectly make approximate and feasible long-term credit assignments and facilitate exploration. Specifically, we first design a trajectory-importance evaluation mechanism to determine the quality of the current trajectory against demonstrations. Then, we introduce a guidance reward computation technology based on trajectory importance to measure the impact of each state-action pair, fusing the demonstrator's state distribution with reward information into the guidance reward. We theoretically analyze the performance improvement caused by smooth guidance rewards and derive a new worst-case lower bound on the performance improvement. Extensive results demonstrate POSG's significant advantages in control performance and convergence speed in four sparse-reward environments, including the grid-world maze, Hopper-v4, HalfCheetah-v4, and Ant maze. Notably, the specific metrics and quantifiable results are investigated to demonstrate the superiority of POSG.
Paper Structure (34 sections, 9 theorems, 46 equations, 10 figures, 2 algorithms)

This paper contains 34 sections, 9 theorems, 46 equations, 10 figures, 2 algorithms.

Key Result

Lemma 1

Suppose $\pi_b$ is a policy implied by the replay memory $\mathcal{M}_E$ that contains all optimal trajectories, and $p(s^\prime|s) = \pi_b(a|s)P(s^\prime|s, a)$ is the state transition function consistent with $\mathcal{M}_E$. Let the discount factor $\lambda$ be 1. According to the definition of t Then, $\pi$ is the optimal policy with the highest entropy under the smooth guidance reward $r_i(\c

Figures (10)

  • Figure 1: A collection of environments that we used to evaluate POSG: (a) Key-Door-Treasure domain; (b) SparseHalfCheetah; (c) SparseHopper; (d) Ant Maze.
  • Figure 2: (a) Success rate in the Key-Door-Treasure domain; (b) The changing trend of the MMD distance.
  • Figure 3: (a) State-action visitation graph of demonstrations; (b) State-action visitation graph of the POSG learned policy.
  • Figure 4: The state-action visitation graphs of all algorithms: (a) PPO; (b) SIL; (c) PPO+D; (d) POSG.
  • Figure 5: (a) Learning curves of average return on the SparseHalfCheetah task; (b) Learning curves of average return on the SparseHopper task.
  • ...and 5 more figures

Theorems & Definitions (17)

  • Remark 1
  • Lemma 1
  • Remark 2
  • Remark 3
  • Theorem 1
  • Remark 4
  • Corollary 1
  • Remark 5
  • Lemma 1
  • proof
  • ...and 7 more