Table of Contents
Fetching ...

Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion

Filip Szatkowski, Bartosz Wójcik, Mikołaj Piórczyński, Simone Scardapane

TL;DR

The proposed method, Dense to Dynamic-$k$ Mixture-of-Experts (D2DMoE), outperforms existing approaches on common NLP and vision tasks, reducing inference cost by up to 60% without significantly impacting performance.

Abstract

Transformer models can face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts of the network into equivalent Mixture-of-Experts (MoE) layers. Despite the crucial role played by activation sparsity, its impact on this process remains unexplored. We demonstrate that the efficiency of the conversion can be significantly enhanced by a proper regularization of the activation sparsity of the base model. Moreover, motivated by the high variance of the number of activated neurons for different inputs, we introduce a more effective dynamic-$k$ expert selection rule that adjusts the number of executed experts on a per-token basis. To achieve further savings, we extend this approach to multi-head attention projections. Finally, we develop an efficient implementation that translates these computational savings into actual wall-clock speedup. The proposed method, Dense to Dynamic-$k$ Mixture-of-Experts (D2DMoE), outperforms existing approaches on common NLP and vision tasks, reducing inference cost by up to 60% without significantly impacting performance.

Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion

TL;DR

The proposed method, Dense to Dynamic- Mixture-of-Experts (D2DMoE), outperforms existing approaches on common NLP and vision tasks, reducing inference cost by up to 60% without significantly impacting performance.

Abstract

Transformer models can face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts of the network into equivalent Mixture-of-Experts (MoE) layers. Despite the crucial role played by activation sparsity, its impact on this process remains unexplored. We demonstrate that the efficiency of the conversion can be significantly enhanced by a proper regularization of the activation sparsity of the base model. Moreover, motivated by the high variance of the number of activated neurons for different inputs, we introduce a more effective dynamic- expert selection rule that adjusts the number of executed experts on a per-token basis. To achieve further savings, we extend this approach to multi-head attention projections. Finally, we develop an efficient implementation that translates these computational savings into actual wall-clock speedup. The proposed method, Dense to Dynamic- Mixture-of-Experts (D2DMoE), outperforms existing approaches on common NLP and vision tasks, reducing inference cost by up to 60% without significantly impacting performance.
Paper Structure (38 sections, 13 equations, 19 figures, 2 tables)

This paper contains 38 sections, 13 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Key components of D2DMoE: (a) We enhance the activation sparsity in the base model. (b) We convert FFN layers in the model to MoE layers with routers that predict the contribution of each expert. (c) We introduce dynamic-$k$ routing that selects the experts for execution based on their predicted contribution.
  • Figure 2: (a) Cost-accuracy tradeoff for a MoEfiedmirzadeh2023relu GPT-2 model obtained starting from models with different levels of activation sparsity. Sparsification correlates with the model performance. (b) Distribution of non-zero activations in the FFN layers in GPT-2-base on OpenWebText, with and without the sparsity enforcement phase. Both models exhibit significant variance, and the mean-to-variance ratio increases in the sparsified model. (c) We propose to exploit the variation in activations through a dynamic-$k$ routing procedure that adapts the number of experts allocated to a sample.
  • Figure 3: Multi-Head Attention projection conversion scheme.
  • Figure 4: FLOPs-performance tradeoff comparison of our method and MoEfication zhang2022moefication on CV and NLP benchmarks. We also include early-exit (ZTW, wojcik2023zero) and token dropping baselines (A-ViT, yin2022vit) for classification. Our method outperforms these baselines across multiple computational budgets.
  • Figure 5: Single D2DMoE layer execution wall-clock time.
  • ...and 14 more figures