Table of Contents
Fetching ...

Enhancing Online Reinforcement Learning with Meta-Learned Objective from Offline Data

Shilong Deng, Zetao Zheng, Hongcai He, Paul Weng, Jie Shao

TL;DR

Solves sparse-reward RL by meta-learning an imitation-learning objective from offline demonstrations, encapsulated in GILD. The approach uses a bi-level optimization where a meta-learner $\omega$ shapes the IL objective guiding the RL policy $\phi$, with a lower-level RL trainer balancing $\mathcal{L}^{\mathrm{RL}}$ and $\mathcal{L}^{\mathrm{GILD}}$. Across four MuJoCo tasks and three off-policy algorithms, RL+GILD outperforms RL+IL and other baselines, achieving near-optimal performance with fast convergence and modest overhead thanks to a small warm-start. This work broadens the utility of imitation information in RL and provides a general, algorithm-agnostic module for improving online learning from sub-optimal offline data.

Abstract

A major challenge in Reinforcement Learning (RL) is the difficulty of learning an optimal policy from sparse rewards. Prior works enhance online RL with conventional Imitation Learning (IL) via a handcrafted auxiliary objective, at the cost of restricting the RL policy to be sub-optimal when the offline data is generated by a non-expert policy. Instead, to better leverage valuable information in offline data, we develop Generalized Imitation Learning from Demonstration (GILD), which meta-learns an objective that distills knowledge from offline data and instills intrinsic motivation towards the optimal policy. Distinct from prior works that are exclusive to a specific RL algorithm, GILD is a flexible module intended for diverse vanilla off-policy RL algorithms. In addition, GILD introduces no domain-specific hyperparameter and minimal increase in computational cost. In four challenging MuJoCo tasks with sparse rewards, we show that three RL algorithms enhanced with GILD significantly outperform state-of-the-art methods.

Enhancing Online Reinforcement Learning with Meta-Learned Objective from Offline Data

TL;DR

Solves sparse-reward RL by meta-learning an imitation-learning objective from offline demonstrations, encapsulated in GILD. The approach uses a bi-level optimization where a meta-learner shapes the IL objective guiding the RL policy , with a lower-level RL trainer balancing and . Across four MuJoCo tasks and three off-policy algorithms, RL+GILD outperforms RL+IL and other baselines, achieving near-optimal performance with fast convergence and modest overhead thanks to a small warm-start. This work broadens the utility of imitation information in RL and provides a general, algorithm-agnostic module for improving online learning from sub-optimal offline data.

Abstract

A major challenge in Reinforcement Learning (RL) is the difficulty of learning an optimal policy from sparse rewards. Prior works enhance online RL with conventional Imitation Learning (IL) via a handcrafted auxiliary objective, at the cost of restricting the RL policy to be sub-optimal when the offline data is generated by a non-expert policy. Instead, to better leverage valuable information in offline data, we develop Generalized Imitation Learning from Demonstration (GILD), which meta-learns an objective that distills knowledge from offline data and instills intrinsic motivation towards the optimal policy. Distinct from prior works that are exclusive to a specific RL algorithm, GILD is a flexible module intended for diverse vanilla off-policy RL algorithms. In addition, GILD introduces no domain-specific hyperparameter and minimal increase in computational cost. In four challenging MuJoCo tasks with sparse rewards, we show that three RL algorithms enhanced with GILD significantly outperform state-of-the-art methods.
Paper Structure (19 sections, 11 equations, 12 figures, 9 tables, 1 algorithm)

This paper contains 19 sections, 11 equations, 12 figures, 9 tables, 1 algorithm.

Figures (12)

  • Figure 1: Illustration of RL+IL with sparse rewards. Conventional IL guides RL to obtain reward signals in early stage (left), while restricting RL policy to be sub-optimal in later stage (right).
  • Figure 2: Workflow of the bi-level optimization framework, with meta-optimization of GILD at the upper level and meta-training of RL at the lower level supported by $\mathcal{L}^{\mathrm{GILD}}_{\omega}$.
  • Figure 3: Learning curve with mean-std (left) and average normalized score (right) in the MuJoCo task(s) with sparse rewards. We normalized the scores using max average return of Expert (with a score of 100).
  • Figure 4: (i) Left: Visualization of evaluation trajectories and corresponding policy optimization paths for DDPG, DDPG+IL, DDPG+GILD in Point2D Navigation. The red star denotes the goal to reach, as well as parameters for the final policy. (ii) Right: KL divergence and loss analysis for SAC+IL and SAC+GILD.
  • Figure 5: Ablation on warm-start steps. GILD converges within $1\%$ of total steps.
  • ...and 7 more figures