Table of Contents
Fetching ...

One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning

Minh Le, Bao-Ngoc Dao, Huy Nguyen, Quyen Tran, Anh Nguyen, Nhat Ho

TL;DR

The paper addresses catastrophic forgetting in continual learning by reconciling efficiency and performance in prompt-based methods. It introduces SMoPE, which restructures a single shared prefix prompt into multiple prompt experts within a sparse Mixture of Experts, enabling dynamic, input-driven activation to mitigate interference. Key contributions include a prompt-attention score aggregation mechanism, an adaptive noise scheme to balance expert utilization, and a prototype-based loss leveraging prefix keys as implicit memory. Empirical results on ImageNet-R, CIFAR-100, and CUB-200 show SMoPE delivering state-of-the-art or competitive performance with significantly reduced parameter counts and computation, highlighting its practical impact for scalable continual learning with ViT backbones.

Abstract

Prompt-based methods have recently gained prominence in Continual Learning (CL) due to their strong performance and memory efficiency. A prevalent strategy in this paradigm assigns a dedicated subset of prompts to each task, which, while effective, incurs substantial computational overhead and causes memory requirements to scale linearly with the number of tasks. Conversely, approaches employing a single shared prompt across tasks offer greater efficiency but often suffer from degraded performance due to knowledge interference. To reconcile this trade-off, we propose SMoPE, a novel framework that integrates the benefits of both task-specific and shared prompt strategies. Inspired by recent findings on the relationship between Prefix Tuning and Mixture of Experts (MoE), SMoPE organizes a shared prompt into multiple "prompt experts" within a sparse MoE architecture. For each input, only a select subset of relevant experts is activated, effectively mitigating interference. To facilitate expert selection, we introduce a prompt-attention score aggregation mechanism that computes a unified proxy score for each expert, enabling dynamic and sparse activation. Additionally, we propose an adaptive noise mechanism to encourage balanced expert utilization while preserving knowledge from prior tasks. To further enhance expert specialization, we design a prototype-based loss function that leverages prefix keys as implicit memory representations. Extensive experiments across multiple CL benchmarks demonstrate that SMoPE consistently outperforms task-specific prompt methods and achieves performance competitive with state-of-the-art approaches, all while significantly reducing parameter counts and computational costs.

One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning

TL;DR

The paper addresses catastrophic forgetting in continual learning by reconciling efficiency and performance in prompt-based methods. It introduces SMoPE, which restructures a single shared prefix prompt into multiple prompt experts within a sparse Mixture of Experts, enabling dynamic, input-driven activation to mitigate interference. Key contributions include a prompt-attention score aggregation mechanism, an adaptive noise scheme to balance expert utilization, and a prototype-based loss leveraging prefix keys as implicit memory. Empirical results on ImageNet-R, CIFAR-100, and CUB-200 show SMoPE delivering state-of-the-art or competitive performance with significantly reduced parameter counts and computation, highlighting its practical impact for scalable continual learning with ViT backbones.

Abstract

Prompt-based methods have recently gained prominence in Continual Learning (CL) due to their strong performance and memory efficiency. A prevalent strategy in this paradigm assigns a dedicated subset of prompts to each task, which, while effective, incurs substantial computational overhead and causes memory requirements to scale linearly with the number of tasks. Conversely, approaches employing a single shared prompt across tasks offer greater efficiency but often suffer from degraded performance due to knowledge interference. To reconcile this trade-off, we propose SMoPE, a novel framework that integrates the benefits of both task-specific and shared prompt strategies. Inspired by recent findings on the relationship between Prefix Tuning and Mixture of Experts (MoE), SMoPE organizes a shared prompt into multiple "prompt experts" within a sparse MoE architecture. For each input, only a select subset of relevant experts is activated, effectively mitigating interference. To facilitate expert selection, we introduce a prompt-attention score aggregation mechanism that computes a unified proxy score for each expert, enabling dynamic and sparse activation. Additionally, we propose an adaptive noise mechanism to encourage balanced expert utilization while preserving knowledge from prior tasks. To further enhance expert specialization, we design a prototype-based loss function that leverages prefix keys as implicit memory representations. Extensive experiments across multiple CL benchmarks demonstrate that SMoPE consistently outperforms task-specific prompt methods and achieves performance competitive with state-of-the-art approaches, all while significantly reducing parameter counts and computational costs.

Paper Structure

This paper contains 28 sections, 3 theorems, 91 equations, 12 figures, 7 tables, 1 algorithm.

Key Result

Theorem A.2

Under the strong identifiability condition for the expert function $h({\bm X},\eta)$, the least squares estimator $\widehat{G}_n$ converges to the true measure $G^*$ at the following rate:

Figures (12)

  • Figure 1: SMoPE Implementation in Attention Layers. The attention mechanism for each head is composed of both pre-trained and prompt components. The pre-trained attention matrix $A_l^\text{pre-trained}$ is computed using standard self-attention. To construct the prompt attention matrix $\tilde{A}_l^\text{prompt}$, we first calculate the average input representation $\tilde{{\bm x}}$, and evaluate the scores for all prompt experts. During training, frequently activated prompt experts are penalized by applying an adaptive noise to their scores, which promotes exploration of underutilized experts for new tasks while preserving essential knowledge in critical experts. A Top-$K$ selection operator then identifies the most relevant experts based on these adjusted scores. The selected scores are row-expanded to form $\tilde{A}_l^\text{prompt}$. Finally, $\tilde{A}_l^\text{prompt}$ is concatenated with $A_l^\text{pre-trained}$ to produce the final attention matrix, which is applied to the expert representations via a dot product, similar to the standard self-attention mechanism.
  • Figure 2: Activation Frequencies of Prompt Experts. Results on CUB-200 with the prompt length $N_p = 25$ and $K = 5$. We show one representative attention head and visualize the frequency with which prompt experts are activated after training on all tasks under different values of $\epsilon$.
  • Figure 3: Comparison of computational cost on ImageNet-R (10-task split), including learnable parameters (millions), training and inference costs (GFLOPs), and relative cost (%) to L2P.
  • Figure 3: Impact of Prompt Length $N_p$ and Number of Selected Experts $K$ on Performance. Performance across different combinations of $N_p$ and $K$ values on CUB-200 with a 10-task split.
  • Figure 4: Distribution of Prompt Expert Scores. Box plot illustrating the distribution of prompt expert scores $\tilde{s}_{j'}$ across attention heads in the first MSA block on the CUB-200 dataset.
  • ...and 7 more figures

Theorems & Definitions (4)

  • Definition A.1: Strong Identifiability
  • Theorem A.2
  • Theorem A.3
  • Lemma A.4