Table of Contents
Fetching ...

Data-Incremental Continual Offline Reinforcement Learning

Sibo Gai, Donglin Wang

TL;DR

Data-incremental continual offline RL (DICORL) tackles lifelong learning from a sequence of offline RL datasets for a single task, revealing active forgetting driven by conservative learning. The authors propose EREIQL, an experience-replay-based ensemble implicit Q-learning method that initializes multiple value networks at low values and leverages experience replay to balance plasticity and stability, with losses $L_V=\mathbb{E}_{(s,a)\sim D}[L_2^\tau(Q(s,a)-\mathbb{E}_j V^j(s))]$ and $L_Q=\mathbb{E}_{(s,a,s')\sim D}[r(s,a)+\gamma\min_j V^j(s')-Q(s,a)]$, and a high $\tau$ (e.g., 0.99). The work formalizes DICORL, demonstrates that existing offline RL and continual-learning methods struggle under data-incremental single-task settings, and shows on D4RL datasets that EREIQL achieves superior lifetime performance and retention. The findings highlight the necessity of ensemble value learning and large replay buffers to mitigate active forgetting in DICORL, offering a path toward practical lifelong offline RL with heterogeneous data sources. Limitations include high time and space costs, motivating future work on efficiency and scalability to more complex tasks.

Abstract

In this work, we propose a new setting of continual learning: data-incremental continual offline reinforcement learning (DICORL), in which an agent is asked to learn a sequence of datasets of a single offline reinforcement learning (RL) task continually, instead of learning a sequence of offline RL tasks with respective datasets. Then, we propose that this new setting will introduce a unique challenge to continual learning: active forgetting, which means that the agent will forget the learnt skill actively. The main reason for active forgetting is conservative learning used by offline RL, which is used to solve the overestimation problem. With conservative learning, the offline RL method will suppress the value of all actions, learnt or not, without selection, unless it is in the just learning dataset. Therefore, inferior data may overlay premium data because of the learning sequence. To solve this problem, we propose a new algorithm, called experience-replay-based ensemble implicit Q-learning (EREIQL), which introduces multiple value networks to reduce the initial value and avoid using conservative learning, and the experience replay to relieve catastrophic forgetting. Our experiments show that EREIQL relieves active forgetting in DICORL and performs well.

Data-Incremental Continual Offline Reinforcement Learning

TL;DR

Data-incremental continual offline RL (DICORL) tackles lifelong learning from a sequence of offline RL datasets for a single task, revealing active forgetting driven by conservative learning. The authors propose EREIQL, an experience-replay-based ensemble implicit Q-learning method that initializes multiple value networks at low values and leverages experience replay to balance plasticity and stability, with losses and , and a high (e.g., 0.99). The work formalizes DICORL, demonstrates that existing offline RL and continual-learning methods struggle under data-incremental single-task settings, and shows on D4RL datasets that EREIQL achieves superior lifetime performance and retention. The findings highlight the necessity of ensemble value learning and large replay buffers to mitigate active forgetting in DICORL, offering a path toward practical lifelong offline RL with heterogeneous data sources. Limitations include high time and space costs, motivating future work on efficiency and scalability to more complex tasks.

Abstract

In this work, we propose a new setting of continual learning: data-incremental continual offline reinforcement learning (DICORL), in which an agent is asked to learn a sequence of datasets of a single offline reinforcement learning (RL) task continually, instead of learning a sequence of offline RL tasks with respective datasets. Then, we propose that this new setting will introduce a unique challenge to continual learning: active forgetting, which means that the agent will forget the learnt skill actively. The main reason for active forgetting is conservative learning used by offline RL, which is used to solve the overestimation problem. With conservative learning, the offline RL method will suppress the value of all actions, learnt or not, without selection, unless it is in the just learning dataset. Therefore, inferior data may overlay premium data because of the learning sequence. To solve this problem, we propose a new algorithm, called experience-replay-based ensemble implicit Q-learning (EREIQL), which introduces multiple value networks to reduce the initial value and avoid using conservative learning, and the experience replay to relieve catastrophic forgetting. Our experiments show that EREIQL relieves active forgetting in DICORL and performs well.
Paper Structure (19 sections, 7 equations, 10 figures, 1 table)

This paper contains 19 sections, 7 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: The diagram of DICORL. The algorithm needs to learn a sequence of datasets of a single task sequentially and expect to perform best on the task as a whole, rather than on individual datasets.
  • Figure 2: The diagram of active forgetting. In this picture, the network learns two datasets sequentially. Each of them has data point $\left(\mathbf{s}_1,\mathbf{a}_1\right)$ and $\left(\mathbf{s}_1,\mathbf{a}_2\right)$ respectively. Learning a worse action after a better action sequentially will result in forgetting directly. This kind of forgetting is not affected by the distribution shift.
  • Figure 3: The diagram of the catastrophic forgetting. In this picture, the network needs to learn two datasets sequentially. Each of them has data point $\left(\mathbf{s}_1,\mathbf{a}_1\right)$ and $\left(\mathbf{s}_2,\mathbf{a}_2\right)$ respectively. We can see that even though these two points have different states, learning the following one will also affect the action selected in the previous state because of the distribution shift.
  • Figure 4: The diagram of the EIQL. By using ensemble value networks, EIQL keeps the initialized value network lower than the Q network at any state, so that the EIQL can use a very small $\tau$ to avoid active forgetting.
  • Figure 5: Performance of HalfCheetah across continual learning algorithms. The network learns a task for 500 epochs and turns to the next. The two dotted lines from left to right represent the switch from random to medium and from medium to random, respectively. Higher is better.
  • ...and 5 more figures