Table of Contents
Fetching ...

Principled Fast and Meta Knowledge Learners for Continual Reinforcement Learning

Ke Sun, Hongming Zhang, Jun Jin, Chao Gao, Xi Chen, Wulong Liu, Linglong Kong

TL;DR

This study proposes a dual-learner framework comprising a fast learner and a meta learner to address continual Reinforcement Learning~(RL) problems, and introduces an adaptive meta warm-up mechanism that selectively harnesses past knowledge.

Abstract

Inspired by the human learning and memory system, particularly the interplay between the hippocampus and cerebral cortex, this study proposes a dual-learner framework comprising a fast learner and a meta learner to address continual Reinforcement Learning~(RL) problems. These two learners are coupled to perform distinct yet complementary roles: the fast learner focuses on knowledge transfer, while the meta learner ensures knowledge integration. In contrast to traditional multi-task RL approaches that share knowledge through average return maximization, our meta learner incrementally integrates new experiences by explicitly minimizing catastrophic forgetting, thereby supporting efficient cumulative knowledge transfer for the fast learner. To facilitate rapid adaptation in new environments, we introduce an adaptive meta warm-up mechanism that selectively harnesses past knowledge. We conduct experiments in various pixel-based and continuous control benchmarks, revealing the superior performance of continual learning for our proposed dual-learner approach relative to baseline methods. The code is released in https://github.com/datake/FAME.

Principled Fast and Meta Knowledge Learners for Continual Reinforcement Learning

TL;DR

This study proposes a dual-learner framework comprising a fast learner and a meta learner to address continual Reinforcement Learning~(RL) problems, and introduces an adaptive meta warm-up mechanism that selectively harnesses past knowledge.

Abstract

Inspired by the human learning and memory system, particularly the interplay between the hippocampus and cerebral cortex, this study proposes a dual-learner framework comprising a fast learner and a meta learner to address continual Reinforcement Learning~(RL) problems. These two learners are coupled to perform distinct yet complementary roles: the fast learner focuses on knowledge transfer, while the meta learner ensures knowledge integration. In contrast to traditional multi-task RL approaches that share knowledge through average return maximization, our meta learner incrementally integrates new experiences by explicitly minimizing catastrophic forgetting, thereby supporting efficient cumulative knowledge transfer for the fast learner. To facilitate rapid adaptation in new environments, we introduce an adaptive meta warm-up mechanism that selectively harnesses past knowledge. We conduct experiments in various pixel-based and continuous control benchmarks, revealing the superior performance of continual learning for our proposed dual-learner approach relative to baseline methods. The code is released in https://github.com/datake/FAME.
Paper Structure (43 sections, 3 theorems, 30 equations, 12 figures, 12 tables, 2 algorithms)

This paper contains 43 sections, 3 theorems, 30 equations, 12 figures, 12 tables, 2 algorithms.

Key Result

Proposition 1

Denote $\widetilde{\pi}_k^M(a|s) = \exp \left(\widetilde{Q}_k^M(a|s)/\tau\right) / \sum_{a^\prime} \exp \left( \widetilde{Q}_k^M(a^\prime|s)/\tau\right)$. After a softmax policy transformation, the Q-value-based meta learner incremental update is written as

Figures (12)

  • Figure 1: Illustration of FAME. In value-based continual RL, the fast learner can be denoted by $\{Q_k\}_{k=1}^K$ accordingly instead of $\{\pi_k\}_{k=1}^K$.
  • Figure 2: (Left) Average performance of the policy across each task across 10 sequences on MinAtar. Results are averaged over 3 seeds. The vertical lines at each point represent the standard errors. (Right) The selection ratio among three warm-up strategies when the arriving environment is previously encountered or novel.
  • Figure 3: (Left) Performance profile of the fast learner across tasks, where the y-axis shows the proportion of tasks that achieve a success rate greater than or equal to the x-axis value. (Right) Average performance by evaluating the average success rates in past tasks across 10 seeds.
  • Figure 4: Learning curves of the fast learner in FAME on MinAtar Environments across 10 sequences of tasks.
  • Figure 5: Learning curves of the fast learner in FAME on the SpaceIvader environment averaged over 3 seeds.
  • ...and 7 more figures

Theorems & Definitions (8)

  • Definition 1
  • Definition 2
  • Proposition 1: Incremental Softmax Q-Value-based Meta Learner Update
  • Proposition 2: Incremental Policy-based Meta Learner Update under Wasserstein Distance
  • Proposition 3: Incremental Q-Value-based Meta Learner Update
  • proof
  • proof
  • proof