Trajectory-Oriented Policy Optimization with Sparse Rewards

Guojian Wang; Faguo Wu; Xiao Zhang

Trajectory-Oriented Policy Optimization with Sparse Rewards

Guojian Wang, Faguo Wu, Xiao Zhang

TL;DR

The paper addresses the challenge of learning under sparse rewards in deep reinforcement learning by leveraging offline demonstration trajectories as soft guidance. It introduces Trajectory Oriented Policy Optimization (TOPO), which uses a maximum mean discrepancy (MMD) based trajectory distance to align the agent's state-action visitation distribution with the demonstrations and reformulates policy optimization as a distance-constrained objective. This constrained problem is converted into a policy-gradient algorithm with an intrinsic distance reward $r^{(i)}(s,a)$, enabling efficient exploration without dense reward design or perfect demonstrations. Empirical results on discrete tasks like Key-Door-Treasure and continuous tasks such as SparseHalfCheetah and SparseHopper show TOPO outperforms baseline methods in exploration speed and final policy quality, highlighting its practical impact for sparse reward RL in both discrete and continuous domains.

Abstract

Mastering deep reinforcement learning (DRL) proves challenging in tasks featuring scant rewards. These limited rewards merely signify whether the task is partially or entirely accomplished, necessitating various exploration actions before the agent garners meaningful feedback. Consequently, the majority of existing DRL exploration algorithms struggle to acquire practical policies within a reasonable timeframe. To address this challenge, we introduce an approach leveraging offline demonstration trajectories for swifter and more efficient online RL in environments with sparse rewards. Our pivotal insight involves treating offline demonstration trajectories as guidance, rather than mere imitation, allowing our method to learn a policy whose distribution of state-action visitation marginally matches that of offline demonstrations. We specifically introduce a novel trajectory distance relying on maximum mean discrepancy (MMD) and cast policy optimization as a distance-constrained optimization problem. We then illustrate that this optimization problem can be streamlined into a policy-gradient algorithm, integrating rewards shaped by insights from offline demonstrations. The proposed algorithm undergoes evaluation across extensive discrete and continuous control tasks with sparse and misleading rewards. The experimental findings demonstrate the significant superiority of our proposed algorithm over baseline methods concerning diverse exploration and the acquisition of an optimal policy.

Trajectory-Oriented Policy Optimization with Sparse Rewards

TL;DR

, enabling efficient exploration without dense reward design or perfect demonstrations. Empirical results on discrete tasks like Key-Door-Treasure and continuous tasks such as SparseHalfCheetah and SparseHopper show TOPO outperforms baseline methods in exploration speed and final policy quality, highlighting its practical impact for sparse reward RL in both discrete and continuous domains.

Abstract

Paper Structure (12 sections, 1 theorem, 17 equations, 3 figures, 1 algorithm)

This paper contains 12 sections, 1 theorem, 17 equations, 3 figures, 1 algorithm.

Introduction
Preliminaries
Reinforcement Learning
Maximum Mean Discrepancy
Proposed Approach
Trajectory-Guided Exploration Strategy
Practical Algorithms
Experiments
Experimental Settings
Results in the Key-Door-Treasure domain
Comparisons on locomotion control tasks
Conclusions

Key Result

Lemma 1

Let $\rho_{\pi}(s, a)$ represent the state-action marginal visitation distribution function produced by the current behavior policy $\pi_\theta$. Let $D(x,\mathcal{M})$ denote the MMD distance measure of the current state-action pair $x$ to the offline demonstration buffer $\mathcal{M}$. Then, the g where and

Figures (3)

Figure 1: (a) Key-Door-Treasure domain; (b) SparseHalfCheetah; (c) SparseHopper.
Figure 2: Evaluation of TOPO in the discrete Key-Door-Treasure task: (a) The learning curves of success rate; (b) The trend of the MMD distance between the current policy and demonstrations;
Figure 3: Evaluation of TOPO on the SparseHalfCheetah and SparseHopper task.

Theorems & Definitions (3)

Remark 1
Lemma 1
proof

Trajectory-Oriented Policy Optimization with Sparse Rewards

TL;DR

Abstract

Trajectory-Oriented Policy Optimization with Sparse Rewards

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (3)