Data-Incremental Continual Offline Reinforcement Learning
Sibo Gai, Donglin Wang
TL;DR
Data-incremental continual offline RL (DICORL) tackles lifelong learning from a sequence of offline RL datasets for a single task, revealing active forgetting driven by conservative learning. The authors propose EREIQL, an experience-replay-based ensemble implicit Q-learning method that initializes multiple value networks at low values and leverages experience replay to balance plasticity and stability, with losses $L_V=\mathbb{E}_{(s,a)\sim D}[L_2^\tau(Q(s,a)-\mathbb{E}_j V^j(s))]$ and $L_Q=\mathbb{E}_{(s,a,s')\sim D}[r(s,a)+\gamma\min_j V^j(s')-Q(s,a)]$, and a high $\tau$ (e.g., 0.99). The work formalizes DICORL, demonstrates that existing offline RL and continual-learning methods struggle under data-incremental single-task settings, and shows on D4RL datasets that EREIQL achieves superior lifetime performance and retention. The findings highlight the necessity of ensemble value learning and large replay buffers to mitigate active forgetting in DICORL, offering a path toward practical lifelong offline RL with heterogeneous data sources. Limitations include high time and space costs, motivating future work on efficiency and scalability to more complex tasks.
Abstract
In this work, we propose a new setting of continual learning: data-incremental continual offline reinforcement learning (DICORL), in which an agent is asked to learn a sequence of datasets of a single offline reinforcement learning (RL) task continually, instead of learning a sequence of offline RL tasks with respective datasets. Then, we propose that this new setting will introduce a unique challenge to continual learning: active forgetting, which means that the agent will forget the learnt skill actively. The main reason for active forgetting is conservative learning used by offline RL, which is used to solve the overestimation problem. With conservative learning, the offline RL method will suppress the value of all actions, learnt or not, without selection, unless it is in the just learning dataset. Therefore, inferior data may overlay premium data because of the learning sequence. To solve this problem, we propose a new algorithm, called experience-replay-based ensemble implicit Q-learning (EREIQL), which introduces multiple value networks to reduce the initial value and avoid using conservative learning, and the experience replay to relieve catastrophic forgetting. Our experiments show that EREIQL relieves active forgetting in DICORL and performs well.
