Table of Contents
Fetching ...

On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models

Chongyang Zhao, Mingsong Li, Haodong Lu, Dong Gong

Abstract

Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is https://zhaoc5.github.io/DyMoE.

On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models

Abstract

Multimodal Continual Instruction Tuning aims to continually enhance Large Vision Language Models (LVLMs) by learning from new data without forgetting previously acquired knowledge. Mixture of Experts (MoE) architectures naturally facilitate this by incrementally adding new experts and expanding routers while keeping the existing ones frozen. However, despite expert isolation, MoE-based continual learners still suffer from forgetting due to routing-drift: old-task tokens become mistakenly attracted to newly added experts, degrading performance on prior tasks. We analyze the failure mode at the token level and reveal the token's dilemma: ambiguous and old tokens in new-task data offer minimal learning benefit yet induce forgetting when routed to new experts, due to their ambiguous routing assignment during training. Motivated by this, we propose LLaVA-DyMoE, a dynamic MoE framework that incrementally expands the MoE with drift-aware token assignment. We characterize token types via their routing score distributions and apply targeted regularization. Specifically, a token-level assignment guidance steers ambiguous and old tokens away from new experts to preserve established routing patterns and alleviate routing-drift, while complementary routing score regularizations enforce expert-group separation and promote new-expert specialization. Extensive experiments demonstrate that our LLaVA-DyMoE effectively mitigates routing-drift-induced forgetting, achieving over a 7% gain in mean final accuracy and a 12% reduction in forgetting compared to baselines. The project page is https://zhaoc5.github.io/DyMoE.

Paper Structure

This paper contains 31 sections, 8 equations, 5 figures, 16 tables.

Figures (5)

  • Figure 1: Performance comparison on the CoIN benchmark, showing per-task final accuracy and mean final accuracy (MFN).
  • Figure 2: Routing-drift analysis in a controlled two-task learning experiment. After learning on the 1st task (SQA), we conduct the 2nd task (TextVQA) training using the baseline (default training) and three different token masking strategies based on token type. Throughout the training stages, we evaluate forgetting (decrease in task 1 accuracy) and new-task learning (improvement of task 2 accuracy). Polynomial regression-fitted curves are used for better visualization and readability of performance changes. The baseline (default training) refers to a scenario where each input token is assigned to all experts (including old frozen and new learnable ones). We then examine the role of each token group based on its routing score: (a) We only retain the contribution of tokens that have high affinity to the new expert group (termed "new tokens"). (b) We mask out the contribution of tokens with a high affinity to the old expert group (termed "old tokens"). (c) We only retain the contribution of tokens with a small affinity difference between the old and new expert group (termed "ambiguous tokens").
  • Figure 3: Overview of our LLaVA-DyMoE method, which applies a dynamic MoE with LoRA experts to each layer of the language backbone in LLaVA. It is a two-fold regularization approach designed to resolve routing-drift-induced forgetting, based on our analysis of different token types in Sec. \ref{['sec:analyze']}. The right panel illustrates this high-level approach: as new tasks (Task $t$) arrive, the router and experts expand, creating a frozen "old group" and a trainable "new group". Our Token Assignment Guidance (TAG) prevents routing-drift (red dashed arrow) by directing tokens to appropriate expert–router groups, complemented by our Routing Score Regularization (RSR) that encourage exclusive token-to-group routing and new-expert specialization. Our method regularizes the router behavior during training and imposes no constraints at inference, allowing seamless combination with other continual learning methods.
  • Figure 4: Layer-wise expert activation on the CoIN benchmark. Activation frequency is shown for each expert group across layers, and circle size reflects how often an expert is activated.
  • Figure 5: Comparisons between baseline IncMoELoRA and LLaVA-DyMoE on cases after training on the final task. The first column shows cases from ScienceQA, the second column shows cases from ImageNet.