MoEfication: Transformer Feed-forward Layers are Mixtures of Experts

Zhengyan Zhang; Yankai Lin; Zhiyuan Liu; Peng Li; Maosong Sun; Jie Zhou

MoEfication: Transformer Feed-forward Layers are Mixtures of Experts

Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, Jie Zhou

TL;DR

MoEfication converts a pre-trained Transformer by splitting FFN parameters into multiple experts and adding routers to select active experts per input. It achieves similar task performance while using only 10–30% of FFN parameters, enabling up to 2x CPU speedups and 1.2x GPU speedups due to reduced FLOPs. The approach also provides a fine-grained interpretability signal for FFN computations, enabling closer study of how linguistic and factual knowledge are stored in FFNs. The authors demonstrate generality across models (including BERT-Large) and offer practical hyperparameter guidance and comparisons to pruning, with code released at the project homepage.

Abstract

Recent work has shown that feed-forward networks (FFNs) in pre-trained Transformers are a key component, storing various linguistic and factual knowledge. However, the computational patterns of FFNs are still unclear. In this work, we study the computational patterns of FFNs and observe that most inputs only activate a tiny ratio of neurons of FFNs. This phenomenon is similar to the sparsity of the human brain, which drives research on functional partitions of the human brain. To verify whether functional partitions also emerge in FFNs, we propose to convert a model into its MoE version with the same parameters, namely MoEfication. Specifically, MoEfication consists of two phases: (1) splitting the parameters of FFNs into multiple functional partitions as experts, and (2) building expert routers to decide which experts will be used for each input. Experimental results show that MoEfication can conditionally use 10% to 30% of FFN parameters while maintaining over 95% original performance for different models on various downstream tasks. Besides, MoEfication brings two advantages: (1) it significantly reduces the FLOPS of inference, i.e., 2x speedup with 25% of FFN parameters, and (2) it provides a fine-grained perspective to study the inner mechanism of FFNs. The source code of this paper can be obtained from https://github.com/thunlp/MoEfication.

MoEfication: Transformer Feed-forward Layers are Mixtures of Experts

TL;DR

Abstract

MoEfication: Transformer Feed-forward Layers are Mixtures of Experts

TL;DR

Abstract

Paper Structure

Table of Contents