Table of Contents
Fetching ...

Demonstration-Guided Continual Reinforcement Learning in Dynamic Environments

Xue Yang, Michael Schukat, Junlin Lu, Patrick Mannion, Karl Mason, Enda Howley

TL;DR

The paper tackles continual reinforcement learning in dynamic environments by introducing DGCRL, which externalizes prior knowledge as demonstrations that directly guide exploration. It integrates a dynamic curriculum to transition from demonstration-guided to self-exploration and a self-evolving repository to continuously improve guidance. Across 2D navigation and MuJoCo locomotion tasks with non-stationary dynamics, DGCRL achieves superior average performance, forward transfer, and reduced forgetting, surpassing strong baselines and ablative variants. The work highlights the practical potential of demonstration-guided strategies for fast adaptation, while acknowledging limitations in forgetting metrics and the need for scalable demonstration management for real-world deployment.

Abstract

Reinforcement learning (RL) excels in various applications but struggles in dynamic environments where the underlying Markov decision process evolves. Continual reinforcement learning (CRL) enables RL agents to continually learn and adapt to new tasks, but balancing stability (preserving prior knowledge) and plasticity (acquiring new knowledge) remains challenging. Existing methods primarily address the stability-plasticity dilemma through mechanisms where past knowledge influences optimization but rarely affects the agent's behavior directly, which may hinder effective knowledge reuse and efficient learning. In contrast, we propose demonstration-guided continual reinforcement learning (DGCRL), which stores prior knowledge in an external, self-evolving demonstration repository that directly guides RL exploration and adaptation. For each task, the agent dynamically selects the most relevant demonstration and follows a curriculum-based strategy to accelerate learning, gradually shifting from demonstration-guided exploration to fully self-exploration. Extensive experiments on 2D navigation and MuJoCo locomotion tasks demonstrate its superior average performance, enhanced knowledge transfer, mitigation of forgetting, and training efficiency. The additional sensitivity analysis and ablation study further validate its effectiveness.

Demonstration-Guided Continual Reinforcement Learning in Dynamic Environments

TL;DR

The paper tackles continual reinforcement learning in dynamic environments by introducing DGCRL, which externalizes prior knowledge as demonstrations that directly guide exploration. It integrates a dynamic curriculum to transition from demonstration-guided to self-exploration and a self-evolving repository to continuously improve guidance. Across 2D navigation and MuJoCo locomotion tasks with non-stationary dynamics, DGCRL achieves superior average performance, forward transfer, and reduced forgetting, surpassing strong baselines and ablative variants. The work highlights the practical potential of demonstration-guided strategies for fast adaptation, while acknowledging limitations in forgetting metrics and the need for scalable demonstration management for real-world deployment.

Abstract

Reinforcement learning (RL) excels in various applications but struggles in dynamic environments where the underlying Markov decision process evolves. Continual reinforcement learning (CRL) enables RL agents to continually learn and adapt to new tasks, but balancing stability (preserving prior knowledge) and plasticity (acquiring new knowledge) remains challenging. Existing methods primarily address the stability-plasticity dilemma through mechanisms where past knowledge influences optimization but rarely affects the agent's behavior directly, which may hinder effective knowledge reuse and efficient learning. In contrast, we propose demonstration-guided continual reinforcement learning (DGCRL), which stores prior knowledge in an external, self-evolving demonstration repository that directly guides RL exploration and adaptation. For each task, the agent dynamically selects the most relevant demonstration and follows a curriculum-based strategy to accelerate learning, gradually shifting from demonstration-guided exploration to fully self-exploration. Extensive experiments on 2D navigation and MuJoCo locomotion tasks demonstrate its superior average performance, enhanced knowledge transfer, mitigation of forgetting, and training efficiency. The additional sensitivity analysis and ablation study further validate its effectiveness.

Paper Structure

This paper contains 24 sections, 1 theorem, 32 equations, 6 figures, 5 tables, 1 algorithm.

Key Result

Lemma 1

The performance difference between the optimal policy and the current learning policy on the states visited by the optimal policy does not exceed C times the performance difference measured on the states visited by the guide policy.

Figures (6)

  • Figure 1: In DGCRL, the agent dynamically selects the most relevant demonstration from the self-evolving repository for each new task, and then follows a curriculum-based strategy to guide exploration and facilitate faster learning.
  • Figure 2: 2D Navigation (a) Type 1 (v1): Only the goal position changes. (b) Type 2 (v2): Only the puddle positions change. (c) Type 3 (v3): Both the puddle and goal positions change.
  • Figure 3: Mujoco locomotion (a) Hopper, with state dimension $|\mathcal{S}|=11$, action dimension $|\mathcal{A}|=3$, and reward function $r=1-4|v_x-v_g|$ (b) Ant, $|\mathcal{S}|=111$, $|\mathcal{A}|=8$, and $r=1-3|v_x-v_g|$ (c) Halfcheetah, $|\mathcal{S}|=20$, $|\mathcal{A}|=6$, and $r=-|v_x-v_g|$.
  • Figure 4: Average return per episode - Main Experiment
  • Figure 5: Average return per episode - Sensitivity Analysis
  • ...and 1 more figures

Theorems & Definitions (2)

  • Lemma 1
  • proof