Enhancing Online Reinforcement Learning with Meta-Learned Objective from Offline Data
Shilong Deng, Zetao Zheng, Hongcai He, Paul Weng, Jie Shao
TL;DR
Solves sparse-reward RL by meta-learning an imitation-learning objective from offline demonstrations, encapsulated in GILD. The approach uses a bi-level optimization where a meta-learner $\omega$ shapes the IL objective guiding the RL policy $\phi$, with a lower-level RL trainer balancing $\mathcal{L}^{\mathrm{RL}}$ and $\mathcal{L}^{\mathrm{GILD}}$. Across four MuJoCo tasks and three off-policy algorithms, RL+GILD outperforms RL+IL and other baselines, achieving near-optimal performance with fast convergence and modest overhead thanks to a small warm-start. This work broadens the utility of imitation information in RL and provides a general, algorithm-agnostic module for improving online learning from sub-optimal offline data.
Abstract
A major challenge in Reinforcement Learning (RL) is the difficulty of learning an optimal policy from sparse rewards. Prior works enhance online RL with conventional Imitation Learning (IL) via a handcrafted auxiliary objective, at the cost of restricting the RL policy to be sub-optimal when the offline data is generated by a non-expert policy. Instead, to better leverage valuable information in offline data, we develop Generalized Imitation Learning from Demonstration (GILD), which meta-learns an objective that distills knowledge from offline data and instills intrinsic motivation towards the optimal policy. Distinct from prior works that are exclusive to a specific RL algorithm, GILD is a flexible module intended for diverse vanilla off-policy RL algorithms. In addition, GILD introduces no domain-specific hyperparameter and minimal increase in computational cost. In four challenging MuJoCo tasks with sparse rewards, we show that three RL algorithms enhanced with GILD significantly outperform state-of-the-art methods.
