PMoE: Progressive Mixture of Experts with Asymmetric Transformer for Continual Learning
Min Jae Jung, JooHee Kim
TL;DR
PMoE addresses catastrophic forgetting in continual learning of large language models by introducing an asymmetric transformer where shallow layers retain general knowledge and deep layers host progressively added experts. A router placed between shallow and deep blocks directs representations to the expanding set of experts, enabling efficient integration of new knowledge with reduced forgetting, and LoRA-style adapters serve as the expert modules. Empirical results on TRACE and general LLM benchmarks show PMoE outperforms replay-based, regularization-based, and previous architecture-based methods, often with far fewer trainable parameters than full fine-tuning. The work demonstrates strong parameter efficiency and robust generalization, suggesting practical applicability for continual learning in large-scale, task-agnostic settings and motivating further exploration of asymmetric MoE designs and routing strategies.
Abstract
Large Language Models (LLMs) encounter significant challenges in continual learning due to catastrophic forgetting, where new information overwrites previously acquired knowledge. This limitation leads to substantial environmental and economic waste. In this study, we introduce the PMoE, Progressive Mixture of Experts with Asymmetric Transformer, which aims to minimize forgetting by utilizing an asymmetric design with shallow layers dedicated to general knowledge and deep layers for new knowledge. PMoE incorporates progressively added experts in deep layers and a router that allocates new knowledge to the appropriate experts efficiently. The router, positioned adjacent to the deep layers, utilizes deep features aggregating consolidated information. This enables the router to perform efficiently, allocating new knowledge to the appropriate experts, which progressively increase in the deep layers. Extensive experiments on TRACE datasets and general language understanding datasets demonstrate that the proposed PMoE outperforms previous state-of-the-art approaches.
