Table of Contents
Fetching ...

Recasting Continual Learning as Sequence Modeling

Soochan Lee, Jaehyeon Son, Gunhee Kim

TL;DR

The paper reframes continual learning as sequence modeling by introducing meta-continual learning (MCL), where a sequence model's forward pass embodies the CL process and meta-training across CL episodes tunes the model. It demonstrates that decoder-only Transformers, including kernel-based efficient variants, can serve as general MCL methods, offering scalable, parallelizable training and a potentially more flexible update mechanism than traditional SGD-based CL. Across seven benchmarks spanning classification and regression, Transformer-based MCL shows strong performance, particularly in large-data regimes, while highlighting efficiency-accuracy tradeoffs with KETs. The work suggests a pathway to integrate advanced sequence-modeling advances into MCL and notes implications for scalability, biological plausibility, and future model development.

Abstract

In this work, we aim to establish a strong connection between two significant bodies of machine learning research: continual learning and sequence modeling. That is, we propose to formulate continual learning as a sequence modeling problem, allowing advanced sequence models to be utilized for continual learning. Under this formulation, the continual learning process becomes the forward pass of a sequence model. By adopting the meta-continual learning (MCL) framework, we can train the sequence model at the meta-level, on multiple continual learning episodes. As a specific example of our new formulation, we demonstrate the application of Transformers and their efficient variants as MCL methods. Our experiments on seven benchmarks, covering both classification and regression, show that sequence models can be an attractive solution for general MCL.

Recasting Continual Learning as Sequence Modeling

TL;DR

The paper reframes continual learning as sequence modeling by introducing meta-continual learning (MCL), where a sequence model's forward pass embodies the CL process and meta-training across CL episodes tunes the model. It demonstrates that decoder-only Transformers, including kernel-based efficient variants, can serve as general MCL methods, offering scalable, parallelizable training and a potentially more flexible update mechanism than traditional SGD-based CL. Across seven benchmarks spanning classification and regression, Transformer-based MCL shows strong performance, particularly in large-data regimes, while highlighting efficiency-accuracy tradeoffs with KETs. The work suggests a pathway to integrate advanced sequence-modeling advances into MCL and notes implications for scalability, biological plausibility, and future model development.

Abstract

In this work, we aim to establish a strong connection between two significant bodies of machine learning research: continual learning and sequence modeling. That is, we propose to formulate continual learning as a sequence modeling problem, allowing advanced sequence models to be utilized for continual learning. Under this formulation, the continual learning process becomes the forward pass of a sequence model. By adopting the meta-continual learning (MCL) framework, we can train the sequence model at the meta-level, on multiple continual learning episodes. As a specific example of our new formulation, we demonstrate the application of Transformers and their efficient variants as MCL methods. Our experiments on seven benchmarks, covering both classification and regression, show that sequence models can be an attractive solution for general MCL.
Paper Structure (26 sections, 1 equation, 5 figures, 6 tables, 3 algorithms)

This paper contains 26 sections, 1 equation, 5 figures, 6 tables, 3 algorithms.

Figures (5)

  • Figure 1: Schematic illustrations of the key concepts. (a) In MCL, multiple CL episodes are split into meta-training and meta-test sets. For each CL episode, a continual learner produces a model from the training stream (blue), which is evaluated on the test set (green). The learner is meta-trained on multiple CL episodes in the meta-training set and evaluated on the meta-test set. (b) In many MCL approaches, the learner mainly depends on SGD to update the model in the inner loop. (c) In our framework, a recurrent sequence model plays both the roles of the learner and the model.
  • Figure 2: Scaling behavior of Transformers.
  • Figure 3: Forgetting analysis of 100-task CASIA benchmark.
  • Figure 4: Forgetting analysis of 100-task Sine benchmark.
  • Figure : Inner loop of conventional SGD-based MCL