Dynamic Mixture of Experts Against Severe Distribution Shifts
Donghu Kim
TL;DR
The paper tackles continual learning under evolving data distributions by addressing plasticity without catastrophic forgetting. It introduces DynamicMoE, which grows capacity by periodically adding bottlenecked experts to MoE layers, avoiding explicit task indices and maintaining a fixed input/output footprint. Empirical results in synthetic continual learning and open-world RL show that DynamicMoE preserves early performance and can surpass full-capacity baselines due to better allocation of new capacity to shifting distributions, with router dynamics revealing distribution-specific expert specialization. This work demonstrates a parameter-efficient path to lifelong learning via dynamic MoE and provides actionable insights into how router-based specialization supports continual adaptation.
Abstract
The challenge of building neural networks that can continuously learn and adapt to evolving data streams is central to the fields of continual learning (CL) and reinforcement learning (RL). This lifelong learning problem is often framed in terms of the plasticity-stability dilemma, focusing on issues like loss of plasticity and catastrophic forgetting. Unlike neural networks, biological brains maintain plasticity through capacity growth, inspiring researchers to explore similar approaches in artificial networks, such as adding capacity dynamically. Prior solutions often lack parameter efficiency or depend on explicit task indices, but Mixture-of-Experts (MoE) architectures offer a promising alternative by specializing experts for distinct distributions. This paper aims to evaluate a DynamicMoE approach for continual and reinforcement learning environments and benchmark its effectiveness against existing network expansion methods.
