Table of Contents
Fetching ...

Unlocking Emergent Modularity in Large Language Models

Zihan Qiu, Zeyu Huang, Jie Fu

TL;DR

The work tackles unlocking emergent modularity in pre-trained language models by externalizing EM as Emergent MoEs (EMoE), constructed from existing FFN keys through constrained clustering and avg-k gating that requires no extra parameters. EMoE layers are fine-tuned with standard methods (e.g., LoRA) and yield stronger in-domain and out-of-domain generalization across multiple backbones and scales, including Llama-7B and Llama-30B, while remaining deployable as standard models after training. Through analysis and ablations, the paper demonstrates that EMoE reveals task-specific modular activation, mitigates negative transfer during fine-tuning, and that improvements are largely due to training-time effects rather than inference-time architecture changes. The approach offers a practical path to harness LM emergent modularity without increasing parameter counts, with broad implications for robust, scalable modular neural design in the LLM era.

Abstract

Modular Neural Networks (MNNs) demonstrate various advantages over monolithic models. Existing MNNs are generally $\textit{explicit}$: their modular architectures are pre-defined, with individual modules expected to implement distinct functions. Recent works reveal that there exists $\textit{implicit}$ modularity in standard pre-trained transformers, namely $\textit{Emergent Modularity}$. They indicate that such modular structures spontaneously exhibit during the early pre-training phase. Despite the benefits of modularity, most Language Models (LMs) are still treated as monolithic models in the pre-train and fine-tune paradigm, with their emergent modularity locked and underutilized. In this work, focusing on unlocking the emergent modularity in LMs, we showcase that standard LMs could be fine-tuned as their Mixture-of-Expert (MoEs) counterparts without introducing any extra parameters. Such MoEs are derived from emergent modularity and are referred to as Emergent MoEs (EMoE). Our experiments demonstrate that fine-tuning EMoE effectively improves downstream in-domain and out-of-domain generalization compared with vanilla fine-tuning. Our analysis and ablation studies further illustrate that it is robust to various configurations and can scale up to Large Language Models (i.e., Llama2-7B and Llama-30B). Code is available at https://github.com/qiuzh20/EMoE.

Unlocking Emergent Modularity in Large Language Models

TL;DR

The work tackles unlocking emergent modularity in pre-trained language models by externalizing EM as Emergent MoEs (EMoE), constructed from existing FFN keys through constrained clustering and avg-k gating that requires no extra parameters. EMoE layers are fine-tuned with standard methods (e.g., LoRA) and yield stronger in-domain and out-of-domain generalization across multiple backbones and scales, including Llama-7B and Llama-30B, while remaining deployable as standard models after training. Through analysis and ablations, the paper demonstrates that EMoE reveals task-specific modular activation, mitigates negative transfer during fine-tuning, and that improvements are largely due to training-time effects rather than inference-time architecture changes. The approach offers a practical path to harness LM emergent modularity without increasing parameter counts, with broad implications for robust, scalable modular neural design in the LLM era.

Abstract

Modular Neural Networks (MNNs) demonstrate various advantages over monolithic models. Existing MNNs are generally : their modular architectures are pre-defined, with individual modules expected to implement distinct functions. Recent works reveal that there exists modularity in standard pre-trained transformers, namely . They indicate that such modular structures spontaneously exhibit during the early pre-training phase. Despite the benefits of modularity, most Language Models (LMs) are still treated as monolithic models in the pre-train and fine-tune paradigm, with their emergent modularity locked and underutilized. In this work, focusing on unlocking the emergent modularity in LMs, we showcase that standard LMs could be fine-tuned as their Mixture-of-Expert (MoEs) counterparts without introducing any extra parameters. Such MoEs are derived from emergent modularity and are referred to as Emergent MoEs (EMoE). Our experiments demonstrate that fine-tuning EMoE effectively improves downstream in-domain and out-of-domain generalization compared with vanilla fine-tuning. Our analysis and ablation studies further illustrate that it is robust to various configurations and can scale up to Large Language Models (i.e., Llama2-7B and Llama-30B). Code is available at https://github.com/qiuzh20/EMoE.
Paper Structure (29 sections, 6 equations, 8 figures, 26 tables)

This paper contains 29 sections, 6 equations, 8 figures, 26 tables.

Figures (8)

  • Figure 1: (a) Existing literatures DBLP:conf/emnlp/GevaSBL21DBLP:conf/emnlp/GevaCWG22 suggest that the FFNs in transformers can be viewed as key-value memories. They regarded the input as a query, the first layer as keys, and the second as values. Given an input, keys are sparsely activated (marked in red). Most of the values don't impact the output. (2) The FFNs block can be partitioned into experts by clustering keys. (3) Afterward, experts' key averages are used as the gating weights. The inner product between $\textbf{x}$ and gating weights are used to select experts.
  • Figure 2: Left: Activations of neurons (z-axis denotes activation value) in FFNs of a pre-trained transformer models. Middle: By clustering the keys in the FFNs layer and rearranging the activation scores accordingly, modular patterns of neuron activation emerge. Right: The heat map between experts and tasks. It is observed that the activation of experts is task-dependent, while some experts are generally shared across different tasks.
  • Figure 3: ID and OOD accuracies compared with LoRA for validating EMoE's training & inference effects.
  • Figure 4: Sparse activated training accuracies with different expert selections.
  • Figure 5: Expert selections during training with distinct gating functions ($\operatorname{avg-k}$ vs. learned) and expert types (splits of FFNs vs. copies of FFNs). The vertical axis illustrates training steps (top-down arrangement signifies begin-end); the horizontal axis represents expert selection frequency within 1K steps (deeper color implies a higher frequency). (a), (b) and (c) correspond to EMoE, EMoE-learn, and GMoE.
  • ...and 3 more figures