Unlocking Emergent Modularity in Large Language Models
Zihan Qiu, Zeyu Huang, Jie Fu
TL;DR
The work tackles unlocking emergent modularity in pre-trained language models by externalizing EM as Emergent MoEs (EMoE), constructed from existing FFN keys through constrained clustering and avg-k gating that requires no extra parameters. EMoE layers are fine-tuned with standard methods (e.g., LoRA) and yield stronger in-domain and out-of-domain generalization across multiple backbones and scales, including Llama-7B and Llama-30B, while remaining deployable as standard models after training. Through analysis and ablations, the paper demonstrates that EMoE reveals task-specific modular activation, mitigates negative transfer during fine-tuning, and that improvements are largely due to training-time effects rather than inference-time architecture changes. The approach offers a practical path to harness LM emergent modularity without increasing parameter counts, with broad implications for robust, scalable modular neural design in the LLM era.
Abstract
Modular Neural Networks (MNNs) demonstrate various advantages over monolithic models. Existing MNNs are generally $\textit{explicit}$: their modular architectures are pre-defined, with individual modules expected to implement distinct functions. Recent works reveal that there exists $\textit{implicit}$ modularity in standard pre-trained transformers, namely $\textit{Emergent Modularity}$. They indicate that such modular structures spontaneously exhibit during the early pre-training phase. Despite the benefits of modularity, most Language Models (LMs) are still treated as monolithic models in the pre-train and fine-tune paradigm, with their emergent modularity locked and underutilized. In this work, focusing on unlocking the emergent modularity in LMs, we showcase that standard LMs could be fine-tuned as their Mixture-of-Expert (MoEs) counterparts without introducing any extra parameters. Such MoEs are derived from emergent modularity and are referred to as Emergent MoEs (EMoE). Our experiments demonstrate that fine-tuning EMoE effectively improves downstream in-domain and out-of-domain generalization compared with vanilla fine-tuning. Our analysis and ablation studies further illustrate that it is robust to various configurations and can scale up to Large Language Models (i.e., Llama2-7B and Llama-30B). Code is available at https://github.com/qiuzh20/EMoE.
