Adaptive trajectory-constrained exploration strategy for deep reinforcement learning

Guojian Wang; Faguo Wu; Xiao Zhang; Ning Guo; Zhiming Zheng

Adaptive trajectory-constrained exploration strategy for deep reinforcement learning

Guojian Wang, Faguo Wu, Xiao Zhang, Ning Guo, Zhiming Zheng

TL;DR

This work tackles the hard-exploration challenge in deep reinforcement learning under sparse and deceptive rewards by introducing Trajectory-Constrained Exploration (TACE). TACE uses incomplete offline suboptimal demonstrations as references and enforces exploration via a maximum mean discrepancy (MMD) based distance constraint between current trajectories and offline data, cast as a constrained policy optimization, with an unconstrained reformulation that enables stable, principled updates. It introduces adaptive constraint boundary normalization and an adaptive scaling mechanism to balance exploration and exploitation, and presents three algorithms—TCPPO, TCHRL, and TCMAE—for non-hierarchical and hierarchical/multi-agent settings. Theoretical analysis yields worst-case bounds on return improvements under the MMD constraints, and empirical results on large gridworlds and MuJoCo mazes show improved temporally extended exploration, avoidance of suboptimal myopic policies, and strong performance in single- and multi-agent tasks. The work provides open-source code and demonstrates a practical, scalable approach to guiding exploration without heavy reliance on additional neural architectures for novelty estimation or perfect demonstrations.

Abstract

Deep reinforcement learning (DRL) faces significant challenges in addressing the hard-exploration problems in tasks with sparse or deceptive rewards and large state spaces. These challenges severely limit the practical application of DRL. Most previous exploration methods relied on complex architectures to estimate state novelty or introduced sensitive hyperparameters, resulting in instability. To mitigate these issues, we propose an efficient adaptive trajectory-constrained exploration strategy for DRL. The proposed method guides the policy of the agent away from suboptimal solutions by leveraging incomplete offline demonstrations as references. This approach gradually expands the exploration scope of the agent and strives for optimality in a constrained optimization manner. Additionally, we introduce a novel policy-gradient-based optimization algorithm that utilizes adaptively clipped trajectory-distance rewards for both single- and multi-agent reinforcement learning. We provide a theoretical analysis of our method, including a deduction of the worst-case approximation error bounds, highlighting the validity of our approach for enhancing exploration. To evaluate the effectiveness of the proposed method, we conducted experiments on two large 2D grid world mazes and several MuJoCo tasks. The extensive experimental results demonstrate the significant advantages of our method in achieving temporally extended exploration and avoiding myopic and suboptimal behaviors in both single- and multi-agent settings. Notably, the specific metrics and quantifiable results further support these findings. The code used in the study is available at \url{https://github.com/buaawgj/TACE}.

Adaptive trajectory-constrained exploration strategy for deep reinforcement learning

TL;DR

Abstract

Paper Structure (45 sections, 11 theorems, 61 equations, 17 figures, 2 tables, 3 algorithms)

This paper contains 45 sections, 11 theorems, 61 equations, 17 figures, 2 tables, 3 algorithms.

Introduction
Related work
Preliminaries
Reinforcement Learning
Multi-Agent Reinforcement Learning
Maximum Mean Discrepancy
Skill-Based HRL Algorithms
Proposed Approach
Trajectory-Constrained Exploration Strategy
Fast Adaptation Methods
Adaptive Constraint Boundary Adjustment Method
Adaptive Scaling Method
Theoretical Analysis of TACE Performance Bounds
Experimental Setup
Environments
...and 30 more sections

Key Result

Lemma 1

Let $\rho_{\pi}(s,a)$ be the state-action visitation distribution induced by the current policy $\pi$. Let $D(x,\mathcal{M})$ be the MMD distance between the state-action pair $x$ and replay memory $\mathcal{M}$. Then, if the policy $\pi$ is parameterized by $\theta$, the gradient of the ${\rm MMD}$ where and

Figures (17)

Figure 1: In a two-layer structure, the high-level policy $\pi_{\theta_h}(z_{t} \vert s_{t})$ samples a latent-codes $z_t$, and decides which low-policy performs over next $p$ time steps. The low-level policy $\pi_{\theta_l}(a_t \vert s_t, z_{t})$ outputs actions $a_t$ straight interacting with a external environment. Note that $z_{t} = z_{kp}$, from $t=kp$ to $(k+1)p - 1$. After $p$ time steps, the high-level policy takes a new high-level action once again.
Figure 2: (a) Gridworld; (b) Deceptive Reacher.
Figure 3: (a) Maze 0 with two different goals. The agent is rewarded 60 for reaching the red goal and 50 for reaching the blue goal. (b) Maze 1 with three different goals. The agent is rewarded 90 for reaching the red goal, 60 for reaching the green goal, and 30 for reaching the blue goal.
Figure 4: (a) Discrete multiple-particle environment (b) SparseAnt Maze
Figure 5: Learning curves of average return and success rate for different methods in Maze 0 when the replay memory stores previous suboptimal trajectories leading to the local optimal goal. The success rate is used to illustrate the frequency at which the agent reaches the global optimal goal during training processes.
...and 12 more figures

Theorems & Definitions (25)

Remark 1
Remark 2
Lemma 1: Gradient Derivation of the MMD term
Remark 3
Remark 4
Theorem 1
Corollary 1
Corollary 2
Definition 1
Definition 2
...and 15 more

Adaptive trajectory-constrained exploration strategy for deep reinforcement learning

TL;DR

Abstract

Adaptive trajectory-constrained exploration strategy for deep reinforcement learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (25)