Table of Contents
Fetching ...

Solving Continual Offline Reinforcement Learning with Decision Transformer

Kaixin Huang, Li Shen, Chen Zhao, Chun Yuan, Dacheng Tao

TL;DR

The paper addresses CORL’s stability–plasticity challenge by leveraging Decision Transformer (DT) as a backbone for offline continual learning. It introduces MH-DT to store task-specific and shared knowledge with distillation and selective rehearsal, and LoRA-DT to adapt without replay buffers via weight-merged sharing and low-rank LoRA fine-tuning. Empirical results on MuJoCo and Meta-World show DT-based methods surpass state-of-the-art CORL baselines in learning efficiency, memory efficiency, and forgetting resistance, with MH-DT delivering strong forward/backward transfer and LoRA-DT offering a compact yet effective buffer-free alternative. Overall, the work demonstrates that DT, when equipped with targeted memory and adaptation mechanisms, can robustly handle sequential offline control tasks while mitigating catastrophic forgetting.

Abstract

Continuous offline reinforcement learning (CORL) combines continuous and offline reinforcement learning, enabling agents to learn multiple tasks from static datasets without forgetting prior tasks. However, CORL faces challenges in balancing stability and plasticity. Existing methods, employing Actor-Critic structures and experience replay (ER), suffer from distribution shifts, low efficiency, and weak knowledge-sharing. We aim to investigate whether Decision Transformer (DT), another offline RL paradigm, can serve as a more suitable offline continuous learner to address these issues. We first compare AC-based offline algorithms with DT in the CORL framework. DT offers advantages in learning efficiency, distribution shift mitigation, and zero-shot generalization but exacerbates the forgetting problem during supervised parameter updates. We introduce multi-head DT (MH-DT) and low-rank adaptation DT (LoRA-DT) to mitigate DT's forgetting problem. MH-DT stores task-specific knowledge using multiple heads, facilitating knowledge sharing with common components. It employs distillation and selective rehearsal to enhance current task learning when a replay buffer is available. In buffer-unavailable scenarios, LoRA-DT merges less influential weights and fine-tunes DT's decisive MLP layer to adapt to the current task. Extensive experiments on MoJuCo and Meta-World benchmarks demonstrate that our methods outperform SOTA CORL baselines and showcase enhanced learning capabilities and superior memory efficiency.

Solving Continual Offline Reinforcement Learning with Decision Transformer

TL;DR

The paper addresses CORL’s stability–plasticity challenge by leveraging Decision Transformer (DT) as a backbone for offline continual learning. It introduces MH-DT to store task-specific and shared knowledge with distillation and selective rehearsal, and LoRA-DT to adapt without replay buffers via weight-merged sharing and low-rank LoRA fine-tuning. Empirical results on MuJoCo and Meta-World show DT-based methods surpass state-of-the-art CORL baselines in learning efficiency, memory efficiency, and forgetting resistance, with MH-DT delivering strong forward/backward transfer and LoRA-DT offering a compact yet effective buffer-free alternative. Overall, the work demonstrates that DT, when equipped with targeted memory and adaptation mechanisms, can robustly handle sequential offline control tasks while mitigating catastrophic forgetting.

Abstract

Continuous offline reinforcement learning (CORL) combines continuous and offline reinforcement learning, enabling agents to learn multiple tasks from static datasets without forgetting prior tasks. However, CORL faces challenges in balancing stability and plasticity. Existing methods, employing Actor-Critic structures and experience replay (ER), suffer from distribution shifts, low efficiency, and weak knowledge-sharing. We aim to investigate whether Decision Transformer (DT), another offline RL paradigm, can serve as a more suitable offline continuous learner to address these issues. We first compare AC-based offline algorithms with DT in the CORL framework. DT offers advantages in learning efficiency, distribution shift mitigation, and zero-shot generalization but exacerbates the forgetting problem during supervised parameter updates. We introduce multi-head DT (MH-DT) and low-rank adaptation DT (LoRA-DT) to mitigate DT's forgetting problem. MH-DT stores task-specific knowledge using multiple heads, facilitating knowledge sharing with common components. It employs distillation and selective rehearsal to enhance current task learning when a replay buffer is available. In buffer-unavailable scenarios, LoRA-DT merges less influential weights and fine-tunes DT's decisive MLP layer to adapt to the current task. Extensive experiments on MoJuCo and Meta-World benchmarks demonstrate that our methods outperform SOTA CORL baselines and showcase enhanced learning capabilities and superior memory efficiency.
Paper Structure (15 sections, 11 equations, 4 figures, 3 tables)

This paper contains 15 sections, 11 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Performance on each task during continuous learning. The target speed of the cheetah in $T1$ to $T6$ keeps increasing and the task becomes more and more difficult. Each task is most similar to the tasks adjacent to it in sequence.
  • Figure 2: Schematic diagram of MH-DT. The left part is the training process. We first learn a separate policy $\mu_n$, copy the parameters of the head part to head $h_n$, then calculate the loss in Eq.(\ref{['eq:loss']}) through the data in replay buffer $B_1, \dots, B_{n-1}$ and $D_n$, and update the corresponding head and shared parameters. The upper right part is the structure of each head. The front part includes embedding layers and a layer-norm layer, and the back part includes a linear layer for predicting actions. The lower right part is the schematic diagram of task selection through cosine similarity.
  • Figure 3: Model architecture of LoRA-DT. In each block of DT, we first fuse and freeze the weights of layers except the MLP layer as in Eq.(\ref{['eq:merge']}), then use LoRA to fine-tune the MLP layer as in Eq.(\ref{['eq:fine-tune']}). The rightmost picture is a schematic diagram of LoRA. We fix the original parameter matrix ${\bm{W}}_0, {\bm{W}}_1$, multiply the two matrices ${\bm{A}}{\bm{B}}$ to represent the update of the weight matrix, and add it to the original calculation result.
  • Figure 4: Process of learning six sequential tasks in Ant_Dir, where our methods MH-DT and LoRA-DT are compared with five baselines and an upper bound PDT. We train 30K steps on one task for continual learning methods.