ModuleFormer: Modularity Emerges from Mixture-of-Experts

Yikang Shen; Zheyu Zhang; Tianyou Cao; Shawn Tan; Zhenfang Chen; Chuang Gan

ModuleFormer: Modularity Emerges from Mixture-of-Experts

Yikang Shen, Zheyu Zhang, Tianyou Cao, Shawn Tan, Zhenfang Chen, Chuang Gan

TL;DR

ModuleFormer introduces a modular Transformer architecture that induces modularity from uncurated data using Sparse Mixture of Experts. It combines SMoE-based feedforward modules with stick-breaking self-attention heads and novel losses (Mutual Information load balancing and Load Concentration) to enable efficient sparse routing, continual learning with new modules, and pruning for lightweight deployment. Empirical results show comparable performance to dense LLMs with roughly half the latency and substantially lower memory, along with improved extendability and resistance to catastrophic forgetting. The work demonstrates effective continual learning and pruning workflows, highlighting practical benefits for adapting large models to new languages or domains without full fine-tuning. Overall, ModuleFormer offers a pathway to scalable, updatable, and efficient LLMs through modularity and targeted specialization.

Abstract

Large Language Models (LLMs) have achieved remarkable results. However, existing models are expensive to train and deploy, and it is also difficult to expand their knowledge beyond pre-training data without forgetting previous knowledge. This paper proposes a new neural network architecture, ModuleFormer, that leverages modularity to improve the efficiency and flexibility of large language models. ModuleFormer is based on the Sparse Mixture of Experts (SMoE). Unlike the previous SMoE-based modular language model, which requires domain-labeled data to learn domain-specific experts, ModuleFormer can induce modularity from uncurated data with its new load balancing and concentration losses. ModuleFormer is a modular architecture that includes two different types of modules: new stick-breaking attention heads and feedforward experts. Different modules are sparsely activated conditions on the input token during training and inference. In our experiment, we found that the modular architecture enables three important abilities for large pre-trained language models: 1) Efficiency, since ModuleFormer only activates a subset of its modules for each input token, thus it could achieve the same performance as dense LLMs with more than two times throughput; 2) Extendability, ModuleFormer is more immune to catastrophic forgetting than dense LLMs and can be easily extended with new modules to learn new knowledge that is not included in the training data; 3) Specialisation, finetuning ModuleFormer could specialize a subset of modules to the finetuning task and the task-unrelated modules could be easily pruned for a lightweight deployment.

ModuleFormer: Modularity Emerges from Mixture-of-Experts

TL;DR

Abstract

Paper Structure (39 sections, 11 equations, 5 figures, 10 tables)

This paper contains 39 sections, 11 equations, 5 figures, 10 tables.

Introduction
Related Work
Model Architecture
Preliminary: Mixture of Experts
Stick-breaking Self-Attention head
Module Manipulation
Load Balancing during Pretraining
Load Concentration during Finetuning
Inserting new Modules for Continual Learning
Experiments
Language Modeling
Pretraining
Evaluation Settings
Results
Ablation Study
...and 24 more sections

Figures (5)

Figure 1: The architecture of ModuleFormer. The sparse activation schema enables high computation efficiency. Adding new modules is simply inserting randomly initialized ones into each layer and then training the new experts and the router on a new dataset. The number and size of new modules can be customized to accommodate different scenarios. Pruning modules involves counting the activation frequency of each module and setting a threshold to remove the least used modules. The percentage of prune could also be customized to achieve a trade of between performance and model size.
Figure 2: KL-divergence between different domains of Pile test set. We collected expert activation frequencies for MLP experts of MoLM-4B-K2 and our ST-MoE baseline on different domains of the Pile test set. We computed the KL-divergency between domains from these expert distributions for both MoLM-4B-K2 and ST-MoE baseline. Lower KL-divergence means similar expert distribution for two domains.
Figure 3: Performance after Pruning on the HumanEval Dataset. The f-p is MoLM-4B-K2 finetuned with load concentration loss and then pruned with expert frequency. The p-f is MoLM-4B-K2 that is pruned with expert frequency first and then finetuned on python corpus. The uni and top is MoLM-4B-K2 that is pruned by uniformly dropping layers or dropping the top layers.
Figure 4: Ablation study for the load concentration loss. The f-p_noaux is MoLM-4B-K2 finetuned without load concentration loss and then pruned with expert frequency.
Figure 5: Performances of different pruning methods. Max means that the expert frequency is normalized with the maximum frequency inside each layer. Sum means that the expert frequency is normalized with the total activation frequency of each layer.

ModuleFormer: Modularity Emerges from Mixture-of-Experts

TL;DR

Abstract

ModuleFormer: Modularity Emerges from Mixture-of-Experts

Authors

TL;DR

Abstract

Table of Contents

Figures (5)