Theory on Mixture-of-Experts in Continual Learning

Hongbo Li; Sen Lin; Lingjie Duan; Yingbin Liang; Ness B. Shroff

Theory on Mixture-of-Experts in Continual Learning

Hongbo Li, Sen Lin, Lingjie Duan, Yingbin Liang, Ness B. Shroff

TL;DR

This work provides the first theoretical characterization of sparsely gated Mixture-of-Experts in continual learning, modeling tasks with overparameterized linear regression and sequential arrivals. It shows that with a carefully designed multi-objective gating loss and an early-termination rule, MoE can diversify experts to specialize across tasks and balance loads, while a router reliably assigns each task to the appropriate expert. The authors derive explicit forgetting and generalization error expressions, demonstrating improvements over a single expert and clarifying how the number of experts affects convergence speed and performance. Experiments on synthetic linear data and DNNs corroborate the theory and suggest practical algorithmic guidelines for MoE in continual learning.

Abstract

Continual learning (CL) has garnered significant attention because of its ability to adapt to new tasks that arrive over time. Catastrophic forgetting (of old tasks) has been identified as a major issue in CL, as the model adapts to new tasks. The Mixture-of-Experts (MoE) model has recently been shown to effectively mitigate catastrophic forgetting in CL, by employing a gating network to sparsify and distribute diverse tasks among multiple experts. However, there is a lack of theoretical analysis of MoE and its impact on the learning performance in CL. This paper provides the first theoretical results to characterize the impact of MoE in CL via the lens of overparameterized linear regression tasks. We establish the benefit of MoE over a single expert by proving that the MoE model can diversify its experts to specialize in different tasks, while its router learns to select the right expert for each task and balance the loads across all experts. Our study further suggests an intriguing fact that the MoE in CL needs to terminate the update of the gating network after sufficient training rounds to attain system convergence, which is not needed in the existing MoE studies that do not consider the continual task arrival. Furthermore, we provide explicit expressions for the expected forgetting and overall generalization error to characterize the benefit of MoE in the learning performance in CL. Interestingly, adding more experts requires additional rounds before convergence, which may not enhance the learning performance. Finally, we conduct experiments on both synthetic and real datasets to extend these insights from linear models to deep neural networks (DNNs), which also shed light on the practical algorithm design for MoE in CL.

Theory on Mixture-of-Experts in Continual Learning

TL;DR

Abstract

Paper Structure (42 sections, 23 theorems, 121 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 42 sections, 23 theorems, 121 equations, 10 figures, 2 tables, 1 algorithm.

Introduction
Related work
Problem setting and MoE model design
CL in linear models
Structure of the MoE model
Training of the MoE model with key designs
Theoretical results on MoE training for CL
Theoretical results on forgetting and generalization
Case I: More experts than tasks
Case II: Fewer experts than tasks
Experiments
Conclusion
Experimental details and additional experiments
Experiments compute resources
Experimental details of \ref{['fig:S_error']}
...and 27 more sections

Key Result

Lemma 1

For any two feature matrices $\mathbf{X}$ and $\Tilde{\mathbf{X}}$ with the same feature signal $\bm{v}_n$, with probability at least $1-o(1)$, their corresponding gate outputs of the same expert $m$ satisfy

Figures (10)

Figure 1: An illustration of the MoE model.
Figure 2: The dynamics of forgetting and overall generalization errors with and without termination of updating $\mathbf{\Theta}_t$ in \ref{['algo:update_MoE']}. Here we set $N=6$ with $K=3$ clusters and vary $M\in\{1,5,10,20\}$.
Figure 3: The dynamics of overall generalization error and test accuracy under the CIFAR-10 dataset (krizhevsky2009learning). Here we set $K=4, N=300$ and $M\in\{1,4,12\}$.
Figure 4: The dynamics of forgetting under the CIFAR-10 dataset. Here we set $N=4$ and $M\in\{1, 4\}$.
Figure 5: Learning performance under the MNIST dataset (lecun1989handwritten). Here we set $K=3, N=60$ and $\hbox{$M\in\{1,4,7\}$}$.
...and 5 more figures

Theorems & Definitions (45)

Definition 1
Lemma 1: $M>N$ version
Proposition 1: $M>N$ version
Proposition 2: $M>N$ version
Proposition 3: $M>N$ version
Proposition 4
Theorem 1
Theorem 2
Lemma 2
proof
...and 35 more

Theory on Mixture-of-Experts in Continual Learning

TL;DR

Abstract

Theory on Mixture-of-Experts in Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (45)