Theory on Mixture-of-Experts in Continual Learning
Hongbo Li, Sen Lin, Lingjie Duan, Yingbin Liang, Ness B. Shroff
TL;DR
This work provides the first theoretical characterization of sparsely gated Mixture-of-Experts in continual learning, modeling tasks with overparameterized linear regression and sequential arrivals. It shows that with a carefully designed multi-objective gating loss and an early-termination rule, MoE can diversify experts to specialize across tasks and balance loads, while a router reliably assigns each task to the appropriate expert. The authors derive explicit forgetting and generalization error expressions, demonstrating improvements over a single expert and clarifying how the number of experts affects convergence speed and performance. Experiments on synthetic linear data and DNNs corroborate the theory and suggest practical algorithmic guidelines for MoE in continual learning.
Abstract
Continual learning (CL) has garnered significant attention because of its ability to adapt to new tasks that arrive over time. Catastrophic forgetting (of old tasks) has been identified as a major issue in CL, as the model adapts to new tasks. The Mixture-of-Experts (MoE) model has recently been shown to effectively mitigate catastrophic forgetting in CL, by employing a gating network to sparsify and distribute diverse tasks among multiple experts. However, there is a lack of theoretical analysis of MoE and its impact on the learning performance in CL. This paper provides the first theoretical results to characterize the impact of MoE in CL via the lens of overparameterized linear regression tasks. We establish the benefit of MoE over a single expert by proving that the MoE model can diversify its experts to specialize in different tasks, while its router learns to select the right expert for each task and balance the loads across all experts. Our study further suggests an intriguing fact that the MoE in CL needs to terminate the update of the gating network after sufficient training rounds to attain system convergence, which is not needed in the existing MoE studies that do not consider the continual task arrival. Furthermore, we provide explicit expressions for the expected forgetting and overall generalization error to characterize the benefit of MoE in the learning performance in CL. Interestingly, adding more experts requires additional rounds before convergence, which may not enhance the learning performance. Finally, we conduct experiments on both synthetic and real datasets to extend these insights from linear models to deep neural networks (DNNs), which also shed light on the practical algorithm design for MoE in CL.
