Table of Contents
Fetching ...

Learning Mamba as a Continual Learner: Meta-learning Selective State Space Models for Efficient Continual Learning

Chongyang Zhao, Dong Gong

TL;DR

Problem: efficiently performing continual learning from non-stationary streams without storing all past representations. Approach: meta-learn a continual learner using a selective SSM (Mamba) and a selectivity regularization (MambaCL), enabling online sequence prediction with fixed-size hidden states. Key findings: MambaCL matches or surpasses Transformer-based baselines on diverse CL/MCL tasks while using fewer parameters and less computation, and shows robustness to long sequences, domain shifts, and noisy inputs. Significance: offers a memory-efficient, generalizable approach to continual adaptation suitable for resource-constrained deployment and real-world non-stationary data.

Abstract

Continual learning (CL) aims to efficiently learn from a non-stationary data stream, without storing or recomputing all seen samples. CL enables prediction on new tasks by incorporating sequential training samples. Building on this connection between CL and sequential modeling, meta-continual learning (MCL) aims to meta-learn an efficient continual learner as a sequence prediction model, with advanced sequence models like Transformers being natural choices. However, despite decent performance, Transformers rely on a linearly growing cache to store all past representations, conflicting with CL's objective of not storing all seen samples and limiting efficiency. In this paper, we focus on meta-learning sequence-prediction-based continual learners without retaining all past representations. While attention-free models with fixed-size hidden states (e.g., Linear Transformers) align with CL's essential goal and efficiency needs, they have shown limited effectiveness in MCL in previous literature. Given Mamba's strong sequence modeling performance and attention-free nature, we explore a key question: Can attention-free models like Mamba perform well on MCL? By formulating Mamba and the SSM for MCL tasks, we propose MambaCL, a meta-learned continual learner. To enhance MambaCL's training, we introduce selectivity regularization, leveraging the connection between Mamba and Transformers to guide its behavior over sequences. Furthermore, we study how Mamba and other models perform across various MCL scenarios through extensive and well-designed experiments. Our results highlight the promising performance and strong generalization of Mamba and attention-free models in MCL, demonstrating its potential for efficient continual learning and adaptation.

Learning Mamba as a Continual Learner: Meta-learning Selective State Space Models for Efficient Continual Learning

TL;DR

Problem: efficiently performing continual learning from non-stationary streams without storing all past representations. Approach: meta-learn a continual learner using a selective SSM (Mamba) and a selectivity regularization (MambaCL), enabling online sequence prediction with fixed-size hidden states. Key findings: MambaCL matches or surpasses Transformer-based baselines on diverse CL/MCL tasks while using fewer parameters and less computation, and shows robustness to long sequences, domain shifts, and noisy inputs. Significance: offers a memory-efficient, generalizable approach to continual adaptation suitable for resource-constrained deployment and real-world non-stationary data.

Abstract

Continual learning (CL) aims to efficiently learn from a non-stationary data stream, without storing or recomputing all seen samples. CL enables prediction on new tasks by incorporating sequential training samples. Building on this connection between CL and sequential modeling, meta-continual learning (MCL) aims to meta-learn an efficient continual learner as a sequence prediction model, with advanced sequence models like Transformers being natural choices. However, despite decent performance, Transformers rely on a linearly growing cache to store all past representations, conflicting with CL's objective of not storing all seen samples and limiting efficiency. In this paper, we focus on meta-learning sequence-prediction-based continual learners without retaining all past representations. While attention-free models with fixed-size hidden states (e.g., Linear Transformers) align with CL's essential goal and efficiency needs, they have shown limited effectiveness in MCL in previous literature. Given Mamba's strong sequence modeling performance and attention-free nature, we explore a key question: Can attention-free models like Mamba perform well on MCL? By formulating Mamba and the SSM for MCL tasks, we propose MambaCL, a meta-learned continual learner. To enhance MambaCL's training, we introduce selectivity regularization, leveraging the connection between Mamba and Transformers to guide its behavior over sequences. Furthermore, we study how Mamba and other models perform across various MCL scenarios through extensive and well-designed experiments. Our results highlight the promising performance and strong generalization of Mamba and attention-free models in MCL, demonstrating its potential for efficient continual learning and adaptation.

Paper Structure

This paper contains 27 sections, 5 equations, 22 figures, 8 tables.

Figures (22)

  • Figure 1: The overall framework of our proposed methods. We meta-train a Mamba Learner $f_{\theta}()$ to perform meta-continual learning (MCL) by processing an online data stream containing paired $({\mathbf{x}}, y)$ examples. Meta-learning of this continual learner is conducted across multiple CL episodes. The model produces predictions by relying on the retained hidden state. Here, we demonstrate how the Mamba learner recurrently processes input data at steps $0$, $2$, and $t-1$, respectively.
  • Figure 2: The Mamba block in MambaCL.
  • Figure 3: Final-layer associations in a 20-task 5-shot meta-testing episode of MambaCL (meta-trained on 20-task 5-shot MCL). All three visualizations share the same training episode (shots $0^{th}\!\!-\!99^{th}$), with test queries at the $100^{th}$ shot from the $1^{st}$, $9^{th}$, and $18^{th}$ tasks, aligning with the $5^{th}\!\!-\!9^{th}$, $45^{th}\!\!-\!49^{th}$, and $90^{th}\!\!-\!94^{th}$ training shots. The red box highlights each query and its corresponding examples. See Appendix Sec. \ref{['sec:vis_appendix']} for additional visualizations comparing Transformer attention and Mamba's associative selectivity.
  • Figure 4: Task
  • Figure 5: Shot
  • ...and 17 more figures