Table of Contents
Fetching ...

PMoE: Progressive Mixture of Experts with Asymmetric Transformer for Continual Learning

Min Jae Jung, JooHee Kim

TL;DR

PMoE addresses catastrophic forgetting in continual learning of large language models by introducing an asymmetric transformer where shallow layers retain general knowledge and deep layers host progressively added experts. A router placed between shallow and deep blocks directs representations to the expanding set of experts, enabling efficient integration of new knowledge with reduced forgetting, and LoRA-style adapters serve as the expert modules. Empirical results on TRACE and general LLM benchmarks show PMoE outperforms replay-based, regularization-based, and previous architecture-based methods, often with far fewer trainable parameters than full fine-tuning. The work demonstrates strong parameter efficiency and robust generalization, suggesting practical applicability for continual learning in large-scale, task-agnostic settings and motivating further exploration of asymmetric MoE designs and routing strategies.

Abstract

Large Language Models (LLMs) encounter significant challenges in continual learning due to catastrophic forgetting, where new information overwrites previously acquired knowledge. This limitation leads to substantial environmental and economic waste. In this study, we introduce the PMoE, Progressive Mixture of Experts with Asymmetric Transformer, which aims to minimize forgetting by utilizing an asymmetric design with shallow layers dedicated to general knowledge and deep layers for new knowledge. PMoE incorporates progressively added experts in deep layers and a router that allocates new knowledge to the appropriate experts efficiently. The router, positioned adjacent to the deep layers, utilizes deep features aggregating consolidated information. This enables the router to perform efficiently, allocating new knowledge to the appropriate experts, which progressively increase in the deep layers. Extensive experiments on TRACE datasets and general language understanding datasets demonstrate that the proposed PMoE outperforms previous state-of-the-art approaches.

PMoE: Progressive Mixture of Experts with Asymmetric Transformer for Continual Learning

TL;DR

PMoE addresses catastrophic forgetting in continual learning of large language models by introducing an asymmetric transformer where shallow layers retain general knowledge and deep layers host progressively added experts. A router placed between shallow and deep blocks directs representations to the expanding set of experts, enabling efficient integration of new knowledge with reduced forgetting, and LoRA-style adapters serve as the expert modules. Empirical results on TRACE and general LLM benchmarks show PMoE outperforms replay-based, regularization-based, and previous architecture-based methods, often with far fewer trainable parameters than full fine-tuning. The work demonstrates strong parameter efficiency and robust generalization, suggesting practical applicability for continual learning in large-scale, task-agnostic settings and motivating further exploration of asymmetric MoE designs and routing strategies.

Abstract

Large Language Models (LLMs) encounter significant challenges in continual learning due to catastrophic forgetting, where new information overwrites previously acquired knowledge. This limitation leads to substantial environmental and economic waste. In this study, we introduce the PMoE, Progressive Mixture of Experts with Asymmetric Transformer, which aims to minimize forgetting by utilizing an asymmetric design with shallow layers dedicated to general knowledge and deep layers for new knowledge. PMoE incorporates progressively added experts in deep layers and a router that allocates new knowledge to the appropriate experts efficiently. The router, positioned adjacent to the deep layers, utilizes deep features aggregating consolidated information. This enables the router to perform efficiently, allocating new knowledge to the appropriate experts, which progressively increase in the deep layers. Extensive experiments on TRACE datasets and general language understanding datasets demonstrate that the proposed PMoE outperforms previous state-of-the-art approaches.
Paper Structure (21 sections, 7 equations, 5 figures, 3 tables)

This paper contains 21 sections, 7 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The overall architecture of PMoE for continual learning. Experts are located in each transformer block for fine-tuning, and the router is situated between the deep and shallow layers. Each shallow block contains only one expert, whereas deep blocks progressively add multiple experts as PMoE encounters new tasks.
  • Figure 2: The performance and computation according to shallow threshold $\tau$. The best performance is at $\tau=24$ and computation decreases in proportion to $\tau$.
  • Figure 3: Probability matrix in which a router allocates text from specific subsets to experts at (up) $\tau=6$ and (down) $\tau=24$.
  • Figure 4: Probability matrix in which a router with Equation \ref{['eq:auxloss']} allocates text from specific subsets to experts.
  • Figure 5: The examples where each token is colored with the largest choice probability in the router at (left) $\tau=6$ and (right) $\tau=24$.