Table of Contents
Fetching ...

Continual Diffuser (CoD): Mastering Continual Offline Reinforcement Learning with Experience Rehearsal

Jifeng Hu, Li Shen, Sili Huang, Zhejian Yang, Hechang Chen, Lichao Sun, Yi Chang, Dacheng Tao

TL;DR

Continual Diffuser (CoD) tackles continual offline RL by marrying diffusion-based trajectory modeling with experience rehearsal to address the plasticity-stability dilemma. It introduces a CORL benchmark with 90 tasks from Continual World and Gym-MuJoCo to study sequential task learning, and trains a task-conditioned diffusion model with periodic rehearsal buffers to retain past knowledge. Empirical results show CoD achieves strong forward transfer while mitigating forgetting across multiple task sequences, outperforming existing diffusion-based methods and several baselines. The approach offers a practical, scalable framework for continual diffusion RL and provides comprehensive benchmarks and analyses to spur further research in continual offline learning for robotics and control.

Abstract

Artificial neural networks, especially recent diffusion-based models, have shown remarkable superiority in gaming, control, and QA systems, where the training tasks' datasets are usually static. However, in real-world applications, such as robotic control of reinforcement learning (RL), the tasks are changing, and new tasks arise in a sequential order. This situation poses the new challenge of plasticity-stability trade-off for training an agent who can adapt to task changes and retain acquired knowledge. In view of this, we propose a rehearsal-based continual diffusion model, called Continual Diffuser (CoD), to endow the diffuser with the capabilities of quick adaptation (plasticity) and lasting retention (stability). Specifically, we first construct an offline benchmark that contains 90 tasks from multiple domains. Then, we train the CoD on each task with sequential modeling and conditional generation for making decisions. Next, we preserve a small portion of previous datasets as the rehearsal buffer and replay it to retain the acquired knowledge. Extensive experiments on a series of tasks show CoD can achieve a promising plasticity-stability trade-off and outperform existing diffusion-based methods and other representative baselines on most tasks.

Continual Diffuser (CoD): Mastering Continual Offline Reinforcement Learning with Experience Rehearsal

TL;DR

Continual Diffuser (CoD) tackles continual offline RL by marrying diffusion-based trajectory modeling with experience rehearsal to address the plasticity-stability dilemma. It introduces a CORL benchmark with 90 tasks from Continual World and Gym-MuJoCo to study sequential task learning, and trains a task-conditioned diffusion model with periodic rehearsal buffers to retain past knowledge. Empirical results show CoD achieves strong forward transfer while mitigating forgetting across multiple task sequences, outperforming existing diffusion-based methods and several baselines. The approach offers a practical, scalable framework for continual diffusion RL and provides comprehensive benchmarks and analyses to spur further research in continual offline learning for robotics and control.

Abstract

Artificial neural networks, especially recent diffusion-based models, have shown remarkable superiority in gaming, control, and QA systems, where the training tasks' datasets are usually static. However, in real-world applications, such as robotic control of reinforcement learning (RL), the tasks are changing, and new tasks arise in a sequential order. This situation poses the new challenge of plasticity-stability trade-off for training an agent who can adapt to task changes and retain acquired knowledge. In view of this, we propose a rehearsal-based continual diffusion model, called Continual Diffuser (CoD), to endow the diffuser with the capabilities of quick adaptation (plasticity) and lasting retention (stability). Specifically, we first construct an offline benchmark that contains 90 tasks from multiple domains. Then, we train the CoD on each task with sequential modeling and conditional generation for making decisions. Next, we preserve a small portion of previous datasets as the rehearsal buffer and replay it to retain the acquired knowledge. Extensive experiments on a series of tasks show CoD can achieve a promising plasticity-stability trade-off and outperform existing diffusion-based methods and other representative baselines on most tasks.
Paper Structure (22 sections, 5 equations, 10 figures, 13 tables, 2 algorithms)

This paper contains 22 sections, 5 equations, 10 figures, 13 tables, 2 algorithms.

Figures (10)

  • Figure 1: The framework of CoD. Unfolding the training process with time, our model slides on the sample chain that is constructed by sampling from the current and rehearsal buffers. For each task $i$, CoD replays small portion samples of previous tasks to reduce catastrophic forgetting and generate a solution that can solve all previous tasks. Detailed structure of CoD is shown in the low right corner.
  • Figure 2: The comparison of CoD and other diffusion-based models under the continual offline RL setting where "w/o" denotes "without", Multitask CoD is a multitask variant of CoD, CoD-LoRA uses low-rank adaptation during training, and CoD-RCR denotes that we train CoD with return condition. IL-rehearsal denotes imitation learning with rehearsal. We train these methods on four arbitrarily selected tasks (tasks 10-15-19-25). The results show that previous diffusion-based methods ("DD-w/o rehearsal", "Diffuser-w/o rehearsal", and "MTDIFF") exhibit severe forgetting when the datasets arrive sequentially.
  • Figure 3: The comparison of our method CoD and other baselines on CW20 where these baselines are trained with online and offline datasets and are trained with 500k gradient steps on each task. In the above figure, we use the dash-dotted lines to indicate the task changes. Part (a) shows the comparison where the baselines are trained in online mode, while in part (b), the baselines are trained with offline datasets.
  • Figure 4: The parameters sensitivity analysis of rehearsal frequency $\upsilon$ and rehearsal sample diversity $\xi$ on CW20.
  • Figure 5: The parameters sensitivity of Ant-dir.
  • ...and 5 more figures