Table of Contents
Fetching ...

Mixture of Experts Meets Prompt-Based Continual Learning

Minh Le, An Nguyen, Huy Nguyen, Trang Nguyen, Trang Pham, Linh Van Ngo, Nhat Ho

TL;DR

This work analyzes prompt-based continual learning through the lens of mixture-of-experts, revealing that Vision Transformer attention effectively implements a MoE with linear experts and quadratic gates. It reinterprets prefix tuning as adding task-specific experts and identifies suboptimal sample efficiency in linear gating, proposing NoRGa—non-linear residual gates that inject a nonlinear, residual term into gating scores while preserving parameter efficiency. The authors provide theoretical results showing improved parameter-estimation rates under an algebraic-independence condition and demonstrate that linear gating MoE can be data-hungry, whereas NoRGa achieves polynomial data efficiency. Empirically, NoRGa attains state-of-the-art final and cumulative accuracies across multiple CL benchmarks and pretraining regimes, with robust performance across activations and reasonable training times, signaling strong practical impact for prompt-based continual learning.

Abstract

Exploiting the power of pre-trained models, prompt-based approaches stand out compared to other continual learning solutions in effectively preventing catastrophic forgetting, even with very few learnable parameters and without the need for a memory buffer. While existing prompt-based continual learning methods excel in leveraging prompts for state-of-the-art performance, they often lack a theoretical explanation for the effectiveness of prompting. This paper conducts a theoretical analysis to unravel how prompts bestow such advantages in continual learning, thus offering a new perspective on prompt design. We first show that the attention block of pre-trained models like Vision Transformers inherently encodes a special mixture of experts architecture, characterized by linear experts and quadratic gating score functions. This realization drives us to provide a novel view on prefix tuning, reframing it as the addition of new task-specific experts, thereby inspiring the design of a novel gating mechanism termed Non-linear Residual Gates (NoRGa). Through the incorporation of non-linear activation and residual connection, NoRGa enhances continual learning performance while preserving parameter efficiency. The effectiveness of NoRGa is substantiated both theoretically and empirically across diverse benchmarks and pretraining paradigms. Our code is publicly available at https://github.com/Minhchuyentoancbn/MoE_PromptCL

Mixture of Experts Meets Prompt-Based Continual Learning

TL;DR

This work analyzes prompt-based continual learning through the lens of mixture-of-experts, revealing that Vision Transformer attention effectively implements a MoE with linear experts and quadratic gates. It reinterprets prefix tuning as adding task-specific experts and identifies suboptimal sample efficiency in linear gating, proposing NoRGa—non-linear residual gates that inject a nonlinear, residual term into gating scores while preserving parameter efficiency. The authors provide theoretical results showing improved parameter-estimation rates under an algebraic-independence condition and demonstrate that linear gating MoE can be data-hungry, whereas NoRGa achieves polynomial data efficiency. Empirically, NoRGa attains state-of-the-art final and cumulative accuracies across multiple CL benchmarks and pretraining regimes, with robust performance across activations and reasonable training times, signaling strong practical impact for prompt-based continual learning.

Abstract

Exploiting the power of pre-trained models, prompt-based approaches stand out compared to other continual learning solutions in effectively preventing catastrophic forgetting, even with very few learnable parameters and without the need for a memory buffer. While existing prompt-based continual learning methods excel in leveraging prompts for state-of-the-art performance, they often lack a theoretical explanation for the effectiveness of prompting. This paper conducts a theoretical analysis to unravel how prompts bestow such advantages in continual learning, thus offering a new perspective on prompt design. We first show that the attention block of pre-trained models like Vision Transformers inherently encodes a special mixture of experts architecture, characterized by linear experts and quadratic gating score functions. This realization drives us to provide a novel view on prefix tuning, reframing it as the addition of new task-specific experts, thereby inspiring the design of a novel gating mechanism termed Non-linear Residual Gates (NoRGa). Through the incorporation of non-linear activation and residual connection, NoRGa enhances continual learning performance while preserving parameter efficiency. The effectiveness of NoRGa is substantiated both theoretically and empirically across diverse benchmarks and pretraining paradigms. Our code is publicly available at https://github.com/Minhchuyentoancbn/MoE_PromptCL
Paper Structure (23 sections, 5 theorems, 99 equations, 3 figures, 8 tables, 1 algorithm)

This paper contains 23 sections, 5 theorems, 99 equations, 3 figures, 8 tables, 1 algorithm.

Key Result

Theorem 4.1

Equipped with a least squares estimator $\widehat{G}_n$ given in equation eq:least_squared_estimator, the model estimation $g_{\widehat{G}_n}(\cdot)$ converges to the true model $g_{G_*}(\cdot)$ at the following rate:

Figures (3)

  • Figure 1: An illustrative depiction of the relationship between self-attention and MoE. Each output vector of a head in the MSA layer can be viewed as the output of a MoE model. These MoE models share the same set of experts encoded in the value matrix. Each entry in the attention matrix corresponds to a score function within this architecture.
  • Figure 2: Left: An illustrative depiction of prefix tuning as the introduction of new experts into pre-trained MoE models. Right: Visualization of NoRGa implementation, integrating non-linear activation and residual connections into the prefix tuning attention matrix.
  • Figure 3: Validation loss on Split CUB-200 throughout the training of the first task.

Theorems & Definitions (10)

  • Definition 2.1: Scaled Dot-Product Attention
  • Definition 2.2: Multi-head Self-Attention Layer
  • Theorem 4.1: Regression Estimation Rate
  • Definition 4.2: Algebraic independence
  • Theorem 4.3
  • Theorem A.1
  • proof : Proof of Theorem \ref{['prop:linear_gate_polynomial_expert']}
  • Lemma A.2
  • proof : Proof of Lemma \ref{['lemma:minimax_lower_bound']}
  • Lemma B.1