Table of Contents
Fetching ...

Dynamic Mixture of Experts Against Severe Distribution Shifts

Donghu Kim

TL;DR

The paper tackles continual learning under evolving data distributions by addressing plasticity without catastrophic forgetting. It introduces DynamicMoE, which grows capacity by periodically adding bottlenecked experts to MoE layers, avoiding explicit task indices and maintaining a fixed input/output footprint. Empirical results in synthetic continual learning and open-world RL show that DynamicMoE preserves early performance and can surpass full-capacity baselines due to better allocation of new capacity to shifting distributions, with router dynamics revealing distribution-specific expert specialization. This work demonstrates a parameter-efficient path to lifelong learning via dynamic MoE and provides actionable insights into how router-based specialization supports continual adaptation.

Abstract

The challenge of building neural networks that can continuously learn and adapt to evolving data streams is central to the fields of continual learning (CL) and reinforcement learning (RL). This lifelong learning problem is often framed in terms of the plasticity-stability dilemma, focusing on issues like loss of plasticity and catastrophic forgetting. Unlike neural networks, biological brains maintain plasticity through capacity growth, inspiring researchers to explore similar approaches in artificial networks, such as adding capacity dynamically. Prior solutions often lack parameter efficiency or depend on explicit task indices, but Mixture-of-Experts (MoE) architectures offer a promising alternative by specializing experts for distinct distributions. This paper aims to evaluate a DynamicMoE approach for continual and reinforcement learning environments and benchmark its effectiveness against existing network expansion methods.

Dynamic Mixture of Experts Against Severe Distribution Shifts

TL;DR

The paper tackles continual learning under evolving data distributions by addressing plasticity without catastrophic forgetting. It introduces DynamicMoE, which grows capacity by periodically adding bottlenecked experts to MoE layers, avoiding explicit task indices and maintaining a fixed input/output footprint. Empirical results in synthetic continual learning and open-world RL show that DynamicMoE preserves early performance and can surpass full-capacity baselines due to better allocation of new capacity to shifting distributions, with router dynamics revealing distribution-specific expert specialization. This work demonstrates a parameter-efficient path to lifelong learning via dynamic MoE and provides actionable insights into how router-based specialization supports continual adaptation.

Abstract

The challenge of building neural networks that can continuously learn and adapt to evolving data streams is central to the fields of continual learning (CL) and reinforcement learning (RL). This lifelong learning problem is often framed in terms of the plasticity-stability dilemma, focusing on issues like loss of plasticity and catastrophic forgetting. Unlike neural networks, biological brains maintain plasticity through capacity growth, inspiring researchers to explore similar approaches in artificial networks, such as adding capacity dynamically. Prior solutions often lack parameter efficiency or depend on explicit task indices, but Mixture-of-Experts (MoE) architectures offer a promising alternative by specializing experts for distinct distributions. This paper aims to evaluate a DynamicMoE approach for continual and reinforcement learning environments and benchmark its effectiveness against existing network expansion methods.

Paper Structure

This paper contains 5 sections, 4 figures.

Figures (4)

  • Figure 1: An example diagram comparing (a) Linear layer, (b) Mixture of linear experts, (c) MLP with bottleneck layer, and (d) Mixture of bottlenecked experts. Ignoring router parameters and biases, all four networks have the same number of parameters.
  • Figure 2: Training accuracy of network expansion methods throughout the synthetic continual learning setup. No expansion starts with its maximal capacity, which explains its strong initial performance. However, it loses its trainability as shown by the degradation in performance. On the other hand, DynaicMoE(g2, g4) is able to maintain its initial performance until the final task. Other expansion methods did not exhibit this property.
  • Figure 3: 1B Craftax results. We visualize the average performance and standard deviation over 3 random seeds. 1-to-2 grow starts from the checkpoint created by 1 expert to effectively see the divergence between adding new experts and not.
  • Figure 4: (Left) Router weight visualization of the first MoE layer in the critic. As the agent enters second stage (dungeon) at $t=221$, the router weight promptly shifts from the old expert to the new one. (Right) Observation of Craftax as the stage transitions from first stage to the second stage.