Table of Contents
Fetching ...

Exploring the Benefit of Activation Sparsity in Pre-training

Zhengyan Zhang, Chaojun Xiao, Qiujieli Qin, Yankai Lin, Zhiyuan Zeng, Xu Han, Zhiyuan Liu, Ruobing Xie, Maosong Sun, Jie Zhou

TL;DR

Switchable Sparse-Dense Learning (SSD) is proposed, which adaptively switches between the Mixtures-of-Experts (MoE) based sparse training and the conventional dense training during the pre-training process, leveraging the efficiency of sparse training and avoiding the static activation correlation of sparse training.

Abstract

Pre-trained Transformers inherently possess the characteristic of sparse activation, where only a small fraction of the neurons are activated for each token. While sparse activation has been explored through post-training methods, its potential in pre-training remains untapped. In this work, we first study how activation properties change during pre-training. Our examination reveals that Transformers exhibit sparse activation throughout the majority of the pre-training process while the activation correlation keeps evolving as training progresses. Leveraging this observation, we propose Switchable Sparse-Dense Learning (SSD). SSD adaptively switches between the Mixtures-of-Experts (MoE) based sparse training and the conventional dense training during the pre-training process, leveraging the efficiency of sparse training and avoiding the static activation correlation of sparse training. Compared to dense training, SSD achieves comparable performance with identical model size and reduces pre-training costs. Moreover, the models trained with SSD can be directly used as MoE models for sparse inference and achieve the same performance as dense models with up to $2\times$ faster inference speed. Codes are available at https://github.com/thunlp/moefication.

Exploring the Benefit of Activation Sparsity in Pre-training

TL;DR

Switchable Sparse-Dense Learning (SSD) is proposed, which adaptively switches between the Mixtures-of-Experts (MoE) based sparse training and the conventional dense training during the pre-training process, leveraging the efficiency of sparse training and avoiding the static activation correlation of sparse training.

Abstract

Pre-trained Transformers inherently possess the characteristic of sparse activation, where only a small fraction of the neurons are activated for each token. While sparse activation has been explored through post-training methods, its potential in pre-training remains untapped. In this work, we first study how activation properties change during pre-training. Our examination reveals that Transformers exhibit sparse activation throughout the majority of the pre-training process while the activation correlation keeps evolving as training progresses. Leveraging this observation, we propose Switchable Sparse-Dense Learning (SSD). SSD adaptively switches between the Mixtures-of-Experts (MoE) based sparse training and the conventional dense training during the pre-training process, leveraging the efficiency of sparse training and avoiding the static activation correlation of sparse training. Compared to dense training, SSD achieves comparable performance with identical model size and reduces pre-training costs. Moreover, the models trained with SSD can be directly used as MoE models for sparse inference and achieve the same performance as dense models with up to faster inference speed. Codes are available at https://github.com/thunlp/moefication.
Paper Structure (21 sections, 4 equations, 7 figures, 7 tables)

This paper contains 21 sections, 4 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Activation sparsity and activation pattern change of three different models during pre-training.
  • Figure 2: Illustration of SSD. During dense training, we monitor the activation pattern change for each checkpoint and transform the model into an SMoE model when the activation pattern becomes stable. During sparse training, we only compute and update the parameters of selected experts for better efficiency.
  • Figure 3: Computational costs (FLOPs) of pre-training and perplexity (PPL) on the validation set of different methods on three representative models. Smaller computational cost means better efficiency and smaller perplexity means better performance.
  • Figure 4: Perplexity on the validation set with different computational costs and inference time by varying the number of selected experts. For comparison, we also scatter the results of SMoE with its default number of selected experts and dense models without any sparsity.
  • Figure 5: Perplexity on the validation set with different computational costs by truncating the experts with small importance scores.
  • ...and 2 more figures